Tags: , , ,

Categories:

Updated:


Unbalanced classes

Classiferes are built to optimize accuracy and hence will often perform poorly on unbalanced classes/unrepresented classes.

Downsampling deleted data from the training set to balance the classes.
Upsampling duplicate data
Resampling limit the value of larger class by sampling and increase smaller class by upsampling. cf) Weighting used as general approach in sklearn models. This adjust the weights so that total weights are equal across classes. Pros of this is no need to sacrifice data.

Problems

For every minor-class data identified as such, we might wrongly label a few major-class points as minor-class. As a result, recall goes up while precision likely go down.

Downsampling add tremendous importance to minor-class
Upsampling mitigates some of the excessive weight on minor-class. Recall is higher than precision but the gap is lesser than downsmapling.\

Undersampling will probably lead to a bit of a higher recall for that minority class at the cost of precision whereas oversampling will keep all the values from our majority class and thus will have a bit of a lower recall than undersampling, but better precision on those predictions

Cross-validation

WE can use ROC curve and AUC (Area Under the Curve) to evaluate the result of each sampling method. Also, F1 and Cohen’s Kappa for further study.

Cohen’s Kappa

This is actually a measure of agreement between two different raters or two different models, where each rater will be classifying n items into mutually exclusive categories so just performing classification, and the goal here is to come up with a ratio of observed agreement between these two models as compared to the probability of there being agreement just by chance. So with unbalanced classes, we want to make sure that we have strong agreement and that agreement between those two different models are better than just agreement by chance. So the higher this value is for Cohen’s Kappa, the more you can trust the agreed upon predictions of those two models.

Stratified Sampling

  • Stratify option for train_test_split\
  • KFold -> StratifiedKFold -> RepeatedStratifiedKFold

Oversampling

The simplist way is to use Random Oversampling. It resample with replacement from minority class, no concerns abou tgeometry of feature space.
Another approach is to Synthetic Oversampling. Start with a point in the minority class and choose one of K nearest neighbors to create a new point randomly, between two data point.

SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is a method for resampling the minority class to create synthetic samples.

  • Regular: connect minoity calss points to any nighbor
  • Borderline: outlier, safe, in-dange
      1. Connect minorty in-danger points only to minority points.
      1. Connect minorty in-danger points to whatever is nearby
  • SVM: Use minoirty support vectors to generate new points.

ADASYN

Adaptive Synthetic Sampling of the minority class (ADASYN) generate new samples proportional to competing classes (for each minority point).

Undersampling -> Study again!!

NearMiss

Choose point from majority points which is nearest to the minority point. The purpose of this is to maintain decision boundaries.

  • NearMiss-1:
  • NearMiss-2:
  • NearMiss-3: two step. 1) For each sample KNN, 2) take largest distance.

Edited Nearest Neighbors

This remove points that don’t agree with neighbors

Blagging

Blagging is balanced bagging. While bootstrapping, it balance each sample by downsampling. Also, for last prediction it uses majority vote.

Leave a comment