I had not heard of XGBoost until I encountered someone else who was also doing some Fintech AI as a side project to learn Artificial Intelligence. Having heard about it, I was brought up to speed on the fact that there are two Ensemble models: Bagging and Boosting. Random Forest is a prevailing Bagging algorithm, while XGBoost is a Boosting algorithm.
These models work well with structured (tabular) data like financial data. XGBoost was supposed to be the "better" algorithm, so I decided to do a quick test before I came into the office this morning.
I cloned the code,and took the function that runs Random Forest, and created a function that runs XGBoost. Then I ran them both. I ran Random Forest first, then XGBoost. The R-squared value was .5588 for Random Forest, and .5502 for XGBoost.
So on this test, Random Forest won - but not by a huge margin.
Both of these models can be tuned. To tune either of these, one would rely on what is known as a Grid Search that scouts out different possibilities of parameters as samples and reports back.
So...I will tune the hyperparameters of both of these and re-test.
Followup:
After tuning Random Forest and re-running, this is what we got.
Best Random Forest parameters found: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}
Best RF CV R2 score: 0.535807284324275
Tuned Random Forest R2: 0.579687260084845
This is a noticeable, if not substantial, and certainly impactful improvement from .5588!
So let's tune XGForest in a similar way, and re-run...
After tuning XGForest and re-running, this is what we got. A substantial improvement.
Best parameters found: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 250, 'subsample': 0.8}
Best CV R2 score: 0.5891076022240813
Tuned XGBoost R2: 0.6212745179228656
Conclusion: In a head-to-head test with no tuning, Random Forest beat XGBoost. But in a head-to-head test with proper tuning, XGBoost was a clear winner with .04 advantage.
.04, by the way is roughly 7% improvement in predictive accuracy.
To rehash our Statistics understanding, R-squared is the co-efficient of determination. It is a statistical metric used to evaluate how well a regression model explains the variability of the target variable.
A 1.0 R-squared means the model predicts perfectly. A value of 0.0, would mean that the model does no better than just predicting the mean of all values.
No comments:
Post a Comment