Data Preprocessing:

Pallavi Satsangi
6 min readMay 6, 2022

Data Preprocessing is one of the key aspects in a Data Science project’s life cycle.

Photo by Campaign Creators on Unsplash

Data Preprocessing is a broad area and consists of several different strategies and techniques that are interrelated. I am capturing down some of the most important ones that I have encountered in my projects.

1. Aggregation

2. Sampling

3. Dimensionality Reduction

4. Feature Creation

5. Discretisation and Binarization

6. Variable Transformation

1. Aggregation:

Aggregation the case of combining two or more objects into a single object. Aggregation is very extremely useful during the initial exploratory analysis of the data.

Uses Of Aggregation:

§ Could be used in case if you want to have smaller datasets instead of huge datasets because the smaller datasets requires less memory and less processing time.

§ If you want to find or look at the data at a very high level and not at a very granular level.

Practically used:

We have daily sales data pertaining to 1000 stores spread across US.

o We could take Store1 and aggregate the daily sales and make it into a monthly sales.

o Similarly we can aggregate the stores based on their geographical locations.

o Aggregate sales based on type of Products.

o Aggregation based on day of a week and so on.

2. Sampling:

Sampling is an approach of selecting a subset of the data that can be analysed from a larger data set. Usually sampling is used as part of preliminary investigation of the data and in the final data analysis.

Be conscious of the sample data that you are going to use. A good sample is representative .i.e. the sample dataset has approximately the same property or behaviour as the original set of data .

There are various sampling approaches like:

· Simple random sampling

· Stratified sampling

· Progressive sampling

· Cluster sampling etc.

Uses of Sampling:

§ Sampling can be used when it is too expensive or time consuming to obtain the whole data set.

§ Analysis of the sample data is less tiring and more convenient than an analysis of the entire data set .

Practical Use:

Before a project kicks off, we are usually given sample data. This is so that we can go through the data to get an understanding of the data wrt the problem that we are trying to solve. The assumption here is that the sample is extracted from the larger population that has similar property. If not then it does not make sense with sampling. Hence make sure to cross check that the sample data that we receive is of value or not.

3.Dimensionality Reduction :

Dimensionality reduction is one of the important pre-processing technique which is used when we have large number of features in a data set. The terms that we usually hear along with dimensionality reduction are :

“Curse of dimensionality” : which refers to the phenomenon where in the analysis of the data becomes significantly difficult as a dimensionality of the data increases .

“Principal Components Analysis (PCA)”: This is a linear algebra technique for continuous combinations of the original attributes they are orthogonal(perpendicular to each other )and capture the maximum amount of variations in the data.

Uses:

§ The algorithm works better if the dimensionality or rather the number of attributes in the data set is less.

§ By applying dimensionality reduction we can eliminate the irrelevant features or the noise

§ This technique can lead to a more understandable model because the model now involves less attributes .

Practical Use:

o Usually when we get complete dataset they are they have lot number of features ,many of them might be useful and many of them might not be useful .Post analysis of the redundant features, we can remove some of the features that are deemed not necessary.

o Many times you might receive the data set in the form of documents where each document is represented by a vector that have the frequencies of each word that occurs in the documents. In such cases the documents end up having typically thousands and tens of thousands of features for each word in the vocabulary .We need to validate these documents and find how the dimensions of these can be reduced.

4.Feature Creation:

Feature creation is an aspect where in from the original attributes ,we create a new set of attributes that can capture important information in a data.

The various ways of creating new attributes are :-

Feature Extraction: As the term mentions, it is the creation of new features from the original data .

Mapping data to a new Space: A completely different view of the data can help us reveal important and very interesting features.

Feature Construction: Sometimes the features in the original dataset have good information but they are not in a suitable form for us or for the algorithm to understand. In such scenarios new features are constructed out of the original features which can help us give more insights than the original features.

Uses:

§ The new number of attributes that are created can be smaller than the original attributes and thus we shall be able to soak in all the previously described benefits of dimensional reduction

§ If a new feature is created from the original features ,we shall be able to analyse this feature better than the original set of features.

Practical Use:

o For example let’s consider a set of images where in each image is to be classified whether it has an animal face or not .The raw data is nothing but a set of pixels and as such is not suitable for many types of classification algorithms. But if the data is processed to provide higher level of features ,say absence or presence of certain edges and areas that are highly correlated with the presence of an animal face then a much broader set of classification techniques can be applied.

o As an example for mapping the data to a new space we can use the example of a time series data which often contains periodic patterns if there is only single periodic pattern and not much noise in the pattern is easily deducted .If on the other hand there are a number of periodic patterns and a significant amount of noise is present, then these patterns are hard to detect. Usually such patterns are usually detected by applying a “Fourier transform” to the time series in order to change the representation in which the frequency information is explicit. So basically for the time series, the Fourier transformation produces a new data object whose attributes are related to frequency.

5. Discretization and Binarization

Sometimes it becomes necessary for us to transform a continuous attribute into a categorical attribute (Discretization), and both continuous and discrete attributes may need to be transformed into one or more binary attributes (Binarization ).

Uses :

§ Many of the data mining algorithms especially certain classification algorithms require that the data be in the form of categorical attributes .This is so that the algorithm can find associated patterns and hence requires that the data be in the form of binary attributes. This is where Discretization and Binarization become important for us.

Practical Use:

o A housing data set has information regarding the size of the house.

o We can use discretisation and binarization to convert those values and bucket (based on size)them in as Small, Medium, Large and Very large houses.

o We use discretisation and binarization to bucket the transactions based on the day of the week or month of year and so on.

6. Variable Transformation:

A variable transformation refers to a transformation that is applied to all the values of a feature. If the magnitude of the feature is important, then the values of the features can be transformed by taking the absolute value.

Please note: Variable transformations should be applied with caution since they change the nature of the data.

We have two important types of variable transformations :

Simple functional transformation and

Normalization .

Uses :

§ Variable transformation becomes often imperative when we want to get a representative feature for the our analysis.

§ It can also be used to let our feature’s distribution get closer to a normal distribution

Practical Use:

o Suppose we have a feature and it consists of the number of data bytes in a session and now this number of data bytes range from one to one-billion. This is a huge range and it may be advantageous to compress it by using a log10 transformation. Various similar transformations like absolute,root(x), x^k can be applied based on different scenarios.

=========== Thank Q! ====================

Reference:

Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach & Vipin Kumar.

--

--

Pallavi Satsangi

Project Manager| Machine Learning |Data Science|Natural Language Processing|Neural Networks| MSc. Business Analytics