Data Cleaning
Categories: IBM Machine Learning
Updated:
Data Cleaning
Why important?
Key aspects of ML depend on clened data e.g.)
- Observation
- Labels: predicted
- Algorithms: estimation
- Features
- Model: assume this is acutal data representent
Messy data generate garbage-in, garbage-out
Reason: Lack of data, too much data, bad data
How to deal with it?
Duplicate or unnecessary data
filter the data as necessary
Inconsistent text and typos
Missing data
- Remove the data\ but easily lose a lot of data
- Imput the data
- Mask the data: create a category for missing values
Outliers
Outlier: observation in data that is distant from most other observations Aberration that are not representing the phenomenon we are trying to explain
how to find outliers?
- Plots: Histogram, Box plot
- Statistics: Interquartile range
-
Residuals:
- Standardized residuals: residual divided by stnd error
- Deleted: residual from fitting model on all data excluding current observation
- Studentized: Deleted residuals divided by standard error of the residuals
Internal and External Studentized Residuals
The usual estimate of is the internally studentized residual
where m is the number of parameters in the model.
But if the i th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude the i th observation from the process of estimating the variance when one is considering whether the i th case may be an outlier, and instead use the externally studentized residual, which is
based on all the residuals except the suspect i th residual. Here is to emphasize that for suspect i are computed with i th case excluded.
If the estimate σ2 includes the i th case, then it is called the internally studentized residual, (also known as the standardized residual). If the estimate is used instead, excluding the i th case, then it is called the externally studentized, .
Policies for outliers
- remove
- assign mean or median
- Transform
- predict: using similar obeservation, or regression
Leave a comment