Tuesday, June 17, 2025

AI / ML - Feature Engineering - Normalization

On my quarterly financials model, the R² is awful. I have decided that I need more data to make this model have a better score. 3-4 quarters is probably not enough. That means I need to build something to go to the SEC and parse it. 

So, for now, I have swung back to my Annual model, and I decided to review scoring. 

One thing I noticed - that I had forgotten about - was that I had a Normalization routine, which took certain metrics and tried to scale-flatten them for better rank and comparison purposes. This routine, it takes certain metrics, and divides them by Total Assets. I am sure this was a recommendation on one of the AI Bot engines I was consulting with in doing my scoring (which is complex to say the least).  

Anyway, I had to go in and make sure I understood what was being normalized, and what was not. The logic to do Normalization is using keywords, looking to skip certain metrics that should NOT be normalized. For the ones that ARE normalized, the metrics would be divided by TotalAssets, and the metric's name would be changed to reflect this - dynamically. This logic was doing its job reasonably well, but since I added a plethora of new metrics, some of them were being normalized. 

So this is the new logic. 
    # --- Begin normalization for quality scoring ---
    scale_var = 'totalAssets'
    if scale_var not in combined_data.columns:
        raise ValueError(f"{scale_var} not found in data columns!")

    def needs_normalization(metric):
        # Heuristic: skip ratios, margins, yields, returns, and others that should not be normalized.
        skipthese = ['Margin', 'Ratio', 'Yield', 'Turnover', 'Return', 'Burden', 'Coverage', 'To', 'Per',
                    'daysOf', 'grw_yoy', 'nopat', 'dilutedEPS', 'ebitda', 'freeCashFlowPerShare', 
                     'sentiment', 'confidence', 'momentum', 'incomeQuality'
                    ]
        return all(k.lower() not in metric.lower() for k in skipthese)

And this took me some time to get working properly. Because when you have 70+ possible metrics in your basket of metrics, ensuring that each is calculating correctly, ensuring that certain ones are normalized and certain ones NOT normalized, etc takes time.

 

No comments:

AI / ML - Feature Engineering - Normalization

On my quarterly financials model, the R² is awful. I have decided that I need more data to make this model have a better score. 3-4 quarters...