Data Cleaning

Tags: EDA, Week2

Categories: IBM Machine Learning

Updated: December 1, 2022

Data Cleaning

Why important?

Key aspects of ML depend on clened data e.g.)

Observation
Labels: predicted
Algorithms: estimation
Features
Model: assume this is acutal data representent

Messy data generate garbage-in, garbage-out

Reason: Lack of data, too much data, bad data

How to deal with it?

Duplicate or unnecessary data

filter the data as necessary

Inconsistent text and typos

Missing data

Remove the data\ but easily lose a lot of data
Imput the data
Mask the data: create a category for missing values

Outliers

Outlier: observation in data that is distant from most other observations Aberration that are not representing the phenomenon we are trying to explain

how to find outliers?

Plots: Histogram, Box plot
Statistics: Interquartile range
Residuals:
1. Standardized residuals: residual divided by stnd error
2. Deleted: residual from fitting model on all data excluding current observation
3. Studentized: Deleted residuals divided by standard error of the residuals

Internal and External Studentized Residuals

The usual estimate of $σ^2$ is the internally studentized residual

{\displaystyle {\widehat {\sigma }}^{2}={1 \over n-m}\sum _{j=1}^{n}{\widehat {\varepsilon \,}}_{j}^{\,2}.}

where m is the number of parameters in the model.

But if the i th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude the i th observation from the process of estimating the variance when one is considering whether the i th case may be an outlier, and instead use the externally studentized residual, which is

{\displaystyle {\widehat {\sigma }}_{(i)}^{2}={1 \over n-m-1}\sum _{\begin{smallmatrix}j=1\\j\neq i\end{smallmatrix}}^{n}{\widehat {\varepsilon \,}}_{j}^{\,2},}

based on all the residuals except the suspect i th residual. Here is to emphasize that ${\displaystyle {\widehat {\varepsilon \,}}_{j}^{\,2}(j\neq i)}$ for suspect i are computed with i th case excluded.

If the estimate σ2 includes the i th case, then it is called the internally studentized residual, $t_{i}$ (also known as the standardized residual). If the estimate $\widehat {\sigma }_{(i)}^{2}$ is used instead, excluding the i th case, then it is called the externally studentized, $t_{i(i)}$ .

Policies for outliers

remove
assign mean or median
Transform
predict: using similar obeservation, or regression

Share on

Twitter Facebook LinkedIn

Younghun Lee

Data Cleaning

Data Cleaning

Why important?

How to deal with it?

Duplicate or unnecessary data

Inconsistent text and typos

Missing data

Outliers

how to find outliers?

Internal and External Studentized Residuals

Policies for outliers

Share on

Leave a comment

You may also enjoy

Policy Gradient Method

Machine Learning on Apple stock daily return2

Machine Learning on Apple stock daily return

Reinforcement Learning