Unbalanced Classes
Tags: Classification, IBM, Supervised, Week4
Categories: IBM Machine Learning
Updated:
Unbalanced classes
Classiferes are built to optimize accuracy and hence will often perform poorly on unbalanced classes/unrepresented classes.
Downsampling deleted data from the training set
to balance the classes.
Upsampling duplicate data
Resampling limit the value of larger class by
sampling and increase smaller class by upsampling. cf)
Weighting used as general approach in sklearn
models. This adjust the weights so that total weights are equal
across classes. Pros of this is no need to sacrifice data.
Problems
For every minor-class data identified as such, we might wrongly label a few major-class points as minor-class. As a result, recall goes up while precision likely go down.
Downsampling add tremendous importance to
minor-class
Upsampling mitigates some of the excessive
weight on minor-class. Recall is higher than precision but the
gap is lesser than downsmapling.\
Undersampling will probably lead to a bit of a higher recall for that minority class at the cost of precision whereas oversampling will keep all the values from our majority class and thus will have a bit of a lower recall than undersampling, but better precision on those predictions
Cross-validation
WE can use ROC curve and AUC (Area Under the Curve) to evaluate the result of each sampling method. Also, F1 and Cohen’s Kappa for further study.
Cohen’s Kappa
This is actually a measure of agreement between two different raters or two different models, where each rater will be classifying n items into mutually exclusive categories so just performing classification, and the goal here is to come up with a ratio of observed agreement between these two models as compared to the probability of there being agreement just by chance. So with unbalanced classes, we want to make sure that we have strong agreement and that agreement between those two different models are better than just agreement by chance. So the higher this value is for Cohen’s Kappa, the more you can trust the agreed upon predictions of those two models.
Stratified Sampling
-
Stratify
option fortrain_test_split
\ -
KFold
->StratifiedKFold
->RepeatedStratifiedKFold
Oversampling
The simplist way is to use Random Oversampling.
It resample with replacement from minority class, no concerns
abou tgeometry of feature space.
Another approach is to Synthetic Oversampling.
Start with a point in the minority class and choose one of K
nearest neighbors to create a new point randomly, between two
data point.
SMOTE
Synthetic Minority Oversampling Technique (SMOTE) is a method for resampling the minority class to create synthetic samples.
- Regular: connect minoity calss points to any nighbor
-
Borderline: outlier, safe, in-dange
-
- Connect minorty in-danger points only to minority points.
-
- Connect minorty in-danger points to whatever is nearby
-
- SVM: Use minoirty support vectors to generate new points.
ADASYN
Adaptive Synthetic Sampling of the minority class (ADASYN) generate new samples proportional to competing classes (for each minority point).
Undersampling -> Study again!!
NearMiss
Choose point from majority points which is nearest to the minority point. The purpose of this is to maintain decision boundaries.
- NearMiss-1:
- NearMiss-2:
- NearMiss-3: two step. 1) For each sample KNN, 2) take largest distance.
Tomek Links
Edited Nearest Neighbors
This remove points that don’t agree with neighbors
Blagging
Blagging is balanced bagging. While bootstrapping, it balance each sample by downsampling. Also, for last prediction it uses majority vote.
Leave a comment