What is SMOTE preprocessing?

Synthetic Minority Oversampling Technique (SMOTE) is a technique used to improve the performance of propensity models with imbalanced datasets, i.e., in the event the outcome you are trying to prevent is infrequent. 

Say you are trying to predict an event that occurs only in 3% of cases. The training dataset is likely going to have very few instances of the outcome we are trying to detect, e.g., 30 for every 1,000 records. This lack of data usually results in poor model performance, and can be felt for most situations when the outcome to be predicted occurs in 10% of cases or less.

What SMOTE does is generate additional (so-called "synthetic") data points that are randomly interpolated from the existing positive outcomes. Doing so, we can create a new training dataset with the same negative outcomes, and many more positive outcomes to the point where the training dataset is now balanced, i.e., there are as many positive outcomes as negative outcomes. 

Once the model is trained on the re-balanced dataset, its performance is then evaluated on a normal, imbalanced test dataset. Performance is usually markedly improved. 

Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.