I was running my AI scoring algorithm, which takes as inputs a bunch of calculated metrics and ratios (features - X axis), and feeds those into a Decision Tree algorithm (Random Forest), against a price prediction (Y axis), which then prints out a report that shows how well the algorithm performed in general (R-squared), and a list of features sorted by their influence on the Y variable (price).
There are numerous algorithms that can do this - the simplest being a Linear Regression model. Decision Trees offer a faster and more efficient - and perhaps more accurate - alternative to linear regression, provided that the tree is pruned and managed correctly and that the tree doesn't get lopsided or imbalanced.
But I ran into problems, especially when checking the results and data carefully. And most of the issues, were related to the data itself.
Data Alignment
I noticed that the predictive z-scores for my features didn't "line up" when I printed them twice. Turns out, this was a data alignment issue. When you are using dataframes, and making copies of these dataframes and merging them, you need to be very very careful or a column can get shifted.
This alignment issue was affecting my model because the metric that WAS a profitability metric, was now being assigned to a solvency metric. Now that I have this fixed, things look much more sensible. But making sure your dataframes are aligned, is a hard-learned lesson.
Outliers
Other issues I ran into today had to do with the fact that when I printed a report out (a weighted scoring report), certain values were far and away better than others. I didn't understand this, and discussed it with the AI I am using as a consultant, who suggested I print out z-scores.
Well, if we look below, we have an evToEBITDA metric of 10.392 (insane value) on 2023 Paramount reporting data.
=== Z-scores for PARA on 2023-12-31 ===
Z-scores for PARA on 2023-12-31 in pillar 'Profitability':
grossProfitMargin: -0.263
operatingProfitMargin: 0.038
netProfitMargin: 0.029
returnOnAssets: -0.033
returnOnEquity: -0.006
returnOnCapitalEmployed: -0.089
returnOnTangibleAssets: 0.004
earningsYield: 0.008
freeCashFlowYield: 0.000
nopat_to_totalAssets: -0.170
operatingReturnOnAssets: -0.215
returnOnInvestedCapital: -0.031
ebitda_to_totalAssets: -0.384
operatingCashFlowToSales: 0.036
evToSales: -0.044
evToOperatingCashFlow: 0.054
evToEBITDA: 10.392
evToFreeCashFlow: 0.039
I audited the metrics and statements, and indeed this is correct - based on what Yahoo was returning to me on the income statement for that year (Normalized EBITDA). The unnormalized EBITDA was better, but in most cases, analysts use the Normalized value. You can't do one-offs in your code for things like this, so what do you do?
I couldn't drop the row, because I was already dropping so many 2020 rows of bad data (due to Covid I suspect). I drop rows that are missing >35% of metrics. When you get a row that has all of the values you need, you tend to want to use it. I don't have code that drops rows that don't have specific dealbreaker metrics - maybe I should, but there are so many metrics that generally I figure I can score and rank even if I am missing one here or there, even a fairly well-known or important one.
So - what do you do?
Winsorization. In other words, capping. It might make sense to invest the effort in Winsorizing all of the metrics and ratios. But for now, I am only doing it on these EBITDA ones.
No comments:
Post a Comment