Data Cleaning

Why important?

Key aspects of ML depend on clened data e.g.)

  • Observation
  • Labels: predicted
  • Algorithms: estimation
  • Features
  • Model: assume this is acutal data representent

Messy data generate garbage-in, garbage-out

Reason: Lack of data, too much data, bad data

How to deal with it?

Duplicate or unnecessary data

filter the data as necessary

Inconsistent text and typos

Missing data

  1. Remove the data\ but easily lose a lot of data
  2. Imput the data
  3. Mask the data: create a category for missing values


Outlier: observation in data that is distant from most other observations Aberration that are not representing the phenomenon we are trying to explain

how to find outliers?

  1. Plots: Histogram, Box plot
  2. Statistics: Interquartile range
  3. Residuals:
    1. Standardized residuals: residual divided by stnd error
    2. Deleted: residual from fitting model on all data excluding current observation
    3. Studentized: Deleted residuals divided by standard error of the residuals

Internal and External Studentized Residuals

The usual estimate of σ2σ^2 is the internally studentized residual

σ^2=1nmj=1nε^j2.{\displaystyle {\widehat {\sigma }}^{2}={1 \over n-m}\sum _{j=1}^{n}{\widehat {\varepsilon \,}}_{j}^{\,2}.}

where m is the number of parameters in the model.

But if the i th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude the i th observation from the process of estimating the variance when one is considering whether the i th case may be an outlier, and instead use the externally studentized residual, which is

σ^(i)2=1nm1j=1jinε^j2,{\displaystyle {\widehat {\sigma }}_{(i)}^{2}={1 \over n-m-1}\sum _{\begin{smallmatrix}j=1\\j\neq i\end{smallmatrix}}^{n}{\widehat {\varepsilon \,}}_{j}^{\,2},}

based on all the residuals except the suspect i th residual. Here is to emphasize that ε^j2(ji){\displaystyle {\widehat {\varepsilon \,}}_{j}^{\,2}(j\neq i)} for suspect i are computed with i th case excluded.

If the estimate σ2 includes the i th case, then it is called the internally studentized residual, ti{\displaystyle t_{i}} (also known as the standardized residual). If the estimate σ^(i)2\widehat {\sigma }_{(i)}^{2} is used instead, excluding the i th case, then it is called the externally studentized, ti(i)t_{i(i)}.

Policies for outliers

  1. remove
  2. assign mean or median
  3. Transform
  4. predict: using similar obeservation, or regression

