Showing posts with label Random Forest. Show all posts
Showing posts with label Random Forest. Show all posts

Wednesday, June 11, 2025

AI/ML - Feature Engineering

Originally, when I first started this project to learn AI, I set it up thus:

Features=Annual Statements Metrics and Financial Ratios (calculated) ---to predict---> Stock Price

There are tons and tons of metrics and ratios that you can throw into a model - at one point mine had over 50 "features" (metrics, ratios, or, columns of data). 

Quickly, you get into Feature Engineering. 

You see, certain metrics are "circular" and co-dependent. You can not use price-derived metrics to try and predict price. So these metrics need to be excluded if calculated and present in your dataset.

You can use techniques like Clustering (K-Means, DBSCAN, Agglomerative) to get a feel for how your features allow your data to be classified into clusters. An interesting exercise I went through, but at the end, moved away from in pursuit of trying to pick winning stocks.

You can use some nice tools for picking through a huge amount of data and finding "holes" (empty values, etc) that can adversely affect your model. 

From a column (feature) perspective, you can: 

  • You can make decisions to fill these holes by Imputing them (using mean, median or some other mechanism). 
  • Or, you can drop these holes.

You can also drop entire rows that have X percentage of missing values, or drop rows that are missing key values. Figuring all of this out takes time. It is part of the Data Engineering.

Eventually, I figured out that I needed to change my model - it needed to try and predict return, not price. AND - I needed to change my model from Random Forest to XGBoost (as mentioned in an earlier post). 

So now, we will be doing this...

Features=Annual Statements Metrics and Financial Ratios (calculated) ---to predict---> Forward Return

Well, guess what? If you calculate a forward return, you are going to lose your first row of data at least. Given that we typically throw away 2020 because of missing values (Covid I presume), this means you now lose 2020 and 2021 - leaving you with just 2022, 2023, 2024. Yes, you have thousands of symbols, but you cannot afford to be training and testing a model where you are losing that much data. But - that is the way it has to be...most financial models are seeking return, not a price. Enlightening. Makes sense.

 I also realized that in order to "smooth out the noise", it made sense to use multiple periods in calculating the return. This causes MORE data to be lost. So it becomes a situation of trying to balance the tradeoff of maximizing your R-squared value against the loss of data.

I added some additional metrics (features): 

  • qoq growth (eps, revenue, free cash flow) 
  • momentum  

 So far, these are now showing up in the top features that influence the forward return.

But the R-squared for the Quarterly model is .09 - which is extremely low and not very valuable. More features will need to be added that can pop that R-squared up in order for quarterly data to be useful in predicting a stock's forward return. 

Tuesday, June 3, 2025

AI / ML - Bagging vs Boosting - A Bake-Off Between Random Forest with XGBoost

I had not heard of XGBoost until I encountered someone else who was also doing some Fintech AI as a side project to learn Artificial Intelligence. Having heard about it, I was brought up to speed on the fact that there are two Ensemble models: Bagging and Boosting. Random Forest is a prevailing Bagging algorithm, while XGBoost is a Boosting algorithm.

These models work well with structured (tabular) data like financial data.  XGBoost was supposed to be the "better" algorithm, so I decided to do a quick test before I came into the office this morning.

I cloned the code,and took the function that runs Random Forest, and created a function that runs XGBoost. Then I ran them both. I ran Random Forest first, then XGBoost. The R-squared value was .5588 for Random Forest, and .5502 for XGBoost.

So on this test, Random Forest won - but not by a huge margin. 

Both of these models can be tuned. To tune either of these, one would rely on what is known as a Grid Search that scouts out different possibilities of parameters as samples and reports back.


So...I will tune the hyperparameters of both of these and re-test. 

Followup:

After tuning Random Forest and re-running, this is what we got.

Best Random Forest parameters found: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}
Best RF CV R2 score: 0.535807284324275
Tuned Random Forest R2: 0.579687260084845

This is a noticeable, if not substantial, and certainly impactful improvement from .5588!

So let's tune XGForest in a similar way, and re-run...


After tuning XGForest and re-running, this is what we got. A substantial improvement.

Best parameters found: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 250, 'subsample': 0.8}
Best CV R2 score: 0.5891076022240813
Tuned XGBoost R2: 0.6212745179228656

Conclusion: In a head-to-head test with no tuning, Random Forest beat XGBoost. But in a head-to-head test with proper tuning, XGBoost was a clear winner with .04 advantage.

.04, by the way is roughly 7% improvement in predictive accuracy.

To rehash our Statistics understanding, R-squared is the co-efficient of determination.  It is a statistical metric used to evaluate how well a regression model explains the variability of the target variable. 

A 1.0 R-squared means the model predicts perfectly. A value of 0.0, would mean that the model does no better than just predicting the mean of all values.


I Need More Financial Quant Data - Techniques On How To Get It

I may have posted earlier about how finding enough data - for free - is extreeeemely difficult. Even if you can find it, ensuring the integr...