Friday, May 23, 2025

AI / ML - Random Forest, Data Wrestling and Z-Scores

I was running my AI scoring algorithm, which takes as inputs a bunch of calculated metrics and ratios (features - X axis), and feeds those into a Decision Tree algorithm (Random Forest), against a price prediction (Y axis), which then prints out a report that shows how well the algorithm performed in general (R-squared), and a list of features sorted by their influence on the Y variable (price). 

There are numerous algorithms that can do this - the simplest being a Linear Regression model.  Decision Trees offer a faster and more efficient - and perhaps more accurate - alternative to linear regression, provided that the tree is pruned and managed correctly and that the tree doesn't get lopsided or imbalanced.

But I ran into problems, especially when checking the results and data carefully. And most of the issues, were related to the data itself.

Data Alignment
I noticed that the predictive z-scores for my features didn't "line up" when I printed them twice. Turns out, this was a data alignment issue. When you are using dataframes, and making copies of these dataframes and merging them, you need to be very very careful or a column can get shifted.

This alignment issue was affecting my model because the metric that WAS a profitability metric, was now being assigned to a solvency metric. Now that I have this fixed, things look much more sensible. But making sure your dataframes are aligned, is a hard-learned lesson.

Outliers
Other issues I ran into today had to do with the fact that when I printed a report out (a weighted scoring report), certain values were far and away better than others. I didn't understand this, and discussed it with the AI I am using as a consultant, who suggested I print out z-scores.

Well, if we look below, we have an evToEBITDA metric of 10.392 (insane value) on 2023 Paramount reporting data.

=== Z-scores for PARA on 2023-12-31 ===
Z-scores for PARA on 2023-12-31 in pillar 'Profitability':
  grossProfitMargin: -0.263
  operatingProfitMargin: 0.038
  netProfitMargin: 0.029
  returnOnAssets: -0.033
  returnOnEquity: -0.006
  returnOnCapitalEmployed: -0.089
  returnOnTangibleAssets: 0.004
  earningsYield: 0.008
  freeCashFlowYield: 0.000
  nopat_to_totalAssets: -0.170
  operatingReturnOnAssets: -0.215
  returnOnInvestedCapital: -0.031
  ebitda_to_totalAssets: -0.384
  operatingCashFlowToSales: 0.036
  evToSales: -0.044
  evToOperatingCashFlow: 0.054
  evToEBITDA: 10.392
  evToFreeCashFlow: 0.039
 
I audited the metrics and statements, and indeed this is correct - based on what Yahoo was returning to me on the income statement for that year (Normalized EBITDA). The unnormalized EBITDA was better, but in most cases, analysts use the Normalized value. You can't do one-offs in your code for things like this, so what do you do?

I couldn't drop the row, because I was already dropping so many 2020 rows of bad data (due to Covid I suspect). I drop rows that are missing >35% of metrics. When you get a row that has all of the values you need, you tend to want to use it. I don't have code that drops rows that don't have specific dealbreaker metrics - maybe I should, but there are so many metrics that generally I figure I can score and rank even if I am missing one here or there, even a fairly well-known or important one. 

So - what do you do?

Winsorization. In other words, capping. It might make sense to invest the effort in Winsorizing all of the metrics and ratios. But for now, I am only doing it on these EBITDA ones.

No comments:

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was re...