Less discussed challenges in a machine learning project in real time.

Pallavi Satsangi
4 min readMay 15, 2020
Photo by Kevin Ku on Unsplash

Working on a machine learning project is not just about working on model creation but also about focussing on the other aspects of the project like deployment, extensibility, modularity etc. Yes we create models, yes we train them and yes we bank on model accuracy for predictions. But this is not the entire life cycle.

One needs to think about:

→ How the data shall be collected/updated on a regular basis? (This is once you have the desired data attributes that is needed for model training.)

→ Python or R?

→ Would you need upstream or downstream data or both?

→Where would be the model deployed? on the client or on the server side.

→How many instances of the model to be deployed?

→what about load balancing?

In my experience, i have categorised activities of development in a ML project into the below 3 phases:

1.Data Collection and Cleaning

2.Model Creation

3.Model Deployment

Phase 1: Data Collection & Cleaning

Data Collection:

In real time scenario you have a problem statement: the biggest challenge is that your data and its attributes are scattered across. It is captured in various formats via different channels.One major task is to go through these attributes and decide which attributes to use for model training and which to ignore . In real time scenario it is impossible to find that one team mate who knows all about the data and who is willing to share it all with you .Usually the knowledge of the data is scattered within different team members working on different modules. So be prepared to reach out to different folks to gather all the information needed.

→ Check the data sources ( DB tables, txt files, excels, Json)

→ Go through the attributes/columns.

→ Check with other team members/ DBA’s/SME’s of your project as to what these attributes signify.

→ Collect the required attributes into a single format like an excel or csv or txt file.

→ Prepare a strategy/framework as to how you plan to recollect the same attributes if you need to further increase the accuracy of the model.

Data cleaning:

Post the data collection comes you data cleaning, data imputation, data manipulation which are needed for your model training.

In real time scenario , your column names would have spaces, special characters, etc.Your data if at all in text format would usually be extremely messy and in an unstructured format, with spaces, unnecessary quotes, unicode characters and spelling mistakes. One really needs to spend a significant amount of time in correcting these. Eg : I would need the values in a column to as “be_smart” but my data would also contain “be-smart” for the same column. Thankfully python has enough packages and methods to clean/manipulate such data .

My experience : Spend as much time as possible in collecting and cleaning your data, the cleaner the data, the more specific your model would be.

→ remove nulls

→ remove spaces

→ remove duplicates

→ Handle outliers

→ Handle missing values etc.

→ Categorical variables.

→ Regular expressions etc.

Phase 2 :

Model Creation :

You can create the model once you have selected the variables based on your requirement and use the cleaned data for training/testing. Here the knowledge of analytics, NN, DL, statistics comes into picture based on your requirement. One needs to know when to use logistic regression over decision tree or when to use SVM over Random forest. I usually try to run 2–3 algorithms and see which one gives the best result/accuracy for my data. Once your model is complete, it needs to be thoroughly tested.

→ pick up algorithm based on your problem statement

→ Try a couple of other associated algorithms with it

→ pick the one with best accuracy

I go about with something like this: create models with different regression algorithms
Eg :

RandomForestClassifier,

LinearSVC(),

LogisticRegression ,

MultinomialNB()

Select the one from above which gives the best accuracy.

Phase 3 :

Model Deployment:

Another challenge that I have found while working on a machine learning project is the deployment of the model in production environment. On a Jupyter notebook, testing by using predict function with a simple input is quite easy. But in real time how do you plan to expose your model ? is it via web-service or bundled into client or as a scheduled job?. You either deploy your models from scratch(my less preferred way) or use the available platforms like IBM Watson, Azure or customised deployment platforms for ML .There are times when your client might have their own platform and you might need to deploy on it.

I initially deployed my module using flask and docker and was successfully in doing so ,but then I encountered the below issues

→how many instances of the model to create?

→ Expose the model as a web-service ?or bundle it and place it in the client (for mobile apps)

→ how many docker instances might be needed ?

→ how would the load balance takes place?

In order to not get into such issues,I switched to deploy my model on my client’s provided deployment platform. It was way easier compared to deploying the model from scratch. One needs to make sure about the version of python packages being used are compatible with the platform and need to get a hang of the steps needed to deploy your model. You don’t need to worry about any other stuff other than make sure that your model runs efficiently and effectively and in case you need to train your model better, make sure the data is available to your model via the platform.

--

--

Pallavi Satsangi

Project Manager| Machine Learning |Data Science|Natural Language Processing|Neural Networks| MSc. Business Analytics