Originally, when I first started this project to learn AI, I set it up thus:
Features=Annual Statements Metrics and Financial Ratios (calculated) ---to predict---> Stock Price
There are tons and tons of metrics and ratios that you can throw into a model - at one point mine had over 50 "features" (metrics, ratios, or, columns of data).
Quickly, you get into Feature Engineering.
You see, certain metrics are "circular" and co-dependent. You can not use price-derived metrics to try and predict price. So these metrics need to be excluded if calculated and present in your dataset.
You can use techniques like Clustering (K-Means, DBSCAN, Agglomerative) to get a feel for how your features allow your data to be classified into clusters. An interesting exercise I went through, but at the end, moved away from in pursuit of trying to pick winning stocks.
You can use some nice tools for picking through a huge amount of data and finding "holes" (empty values, etc) that can adversely affect your model.
From a column (feature) perspective, you can:
- You can make decisions to fill these holes by Imputing them (using mean, median or some other mechanism).
- Or, you can drop these holes.
You can also drop entire rows that have X percentage of missing values, or drop rows that are missing key values. Figuring all of this out takes time. It is part of the Data Engineering.
Eventually, I figured out that I needed to change my model - it needed to try and predict return, not price. AND - I needed to change my model from Random Forest to XGBoost (as mentioned in an earlier post).
So now, we will be doing this...
Features=Annual Statements Metrics and Financial Ratios (calculated) ---to predict---> Forward Return
Well, guess what? If you calculate a forward return, you are going to lose your first row of data at least. Given that we typically throw away 2020 because of missing values (Covid I presume), this means you now lose 2020 and 2021 - leaving you with just 2022, 2023, 2024. Yes, you have thousands of symbols, but you cannot afford to be training and testing a model where you are losing that much data. But - that is the way it has to be...most financial models are seeking return, not a price. Enlightening. Makes sense.
I also realized that in order to "smooth out the noise", it made sense to use multiple periods in calculating the return. This causes MORE data to be lost. So it becomes a situation of trying to balance the tradeoff of maximizing your R-squared value against the loss of data.
I added some additional metrics (features):
- qoq growth (eps, revenue, free cash flow)
- momentum
So far, these are now showing up in the top features that influence the forward return.
But the R-squared for the Quarterly model is .09 - which is extremely low and not very valuable. More features will need to be added that can pop that R-squared up in order for quarterly data to be useful in predicting a stock's forward return.
No comments:
Post a Comment