Grasping Technology: Data

Showing posts with label Data. Show all posts

Friday, August 1, 2025

AI / ML - Modeling Fundamentals - Mistakes Found and Corrected

After adding Earnings Surprise Score data into my 1 year fwd return predicting model, I kind of felt as though I had hit the road on the model. The Earnings Surprise Score did move the needle. But with all of the effort in Feature Engineering I had put into this model, the only thing I really felt I could add to it, was sentiment (news). Given that news is more of a real-time concept, grows stale, and would be relevant for only the latest row of data, I decided to do some final reviews, and move on, or "graduate" to some new things - like maybe trying out a neural network or doing more current or real-time analysis. In fact, I had already tried a Quarterly model, but the R-squared on it was terrible and I decided not to use it - not to even ensemble it with the annual report data model.

So - I asked a few different LLMs to code review my model. And I was horrified to learn that because of using LLMs to continually tweak my model, I had wound up with issues related to "Snippet Integration".

Specifically, I had some major issues:

1. Train/Test Split Happened Too Early or Multiple Times

Splitting data before full preprocessing (e.g., before feature scaling, imputation, or log-transforming the target).
Redundant train/test splits defined in multiple places — some commented, some active — leading to potential inconsistencies depending on which was used.

2. No Validation Set

Originally, data was split into training and test sets.

This meant that model tuning (e.g. SHAP threshold, hyperparameter selection) was inadvertently leaking test set information.

Now corrected with a clean train/val/test split.

3. Inconsistent Preprocessing Between Train and Test

Preprocessing steps like imputation, outlier clipping, or scaling were not always applied after the split.
This risked information bleeding from test → train, violating standard ML practice.

4. Improper Handling of Invalid Target Values (fwdreturn)

NaN, inf, and unrealistic values (like ≤ -100%) were not being consistently filtered.

This led to silent corruption of both training and evaluation scores.

Now fixed with a strict invalid_mask and logging/reporting of dropped rows.

5. Redundant or Conflicting Feature Definitions

There were multiple, overlapping blocks like:

features = [col for col in df.columns if ...]
X = df[features]

This made it unclear which feature list was actually being used.

Sector z-scored and raw versions were sometimes duplicated or mixed without clarity.

6. Scaling and Z-Scoring Logic Was Not Modular or Controlled

Originally, some features were being z-scored after asset-scaling (which didn’t make sense).

Some metrics were scaled both to assets and z-scored sector-wise, which polluted the modeling signal.

Now addressed with clear separation and feature naming conventions.

7. SHAP Was Applied to a Noisy or Unclean Feature Space

Without proper pruning first (e.g., dropping all-NaN columns), SHAP feature importance included irrelevant or broken columns.

This could inflate feature count or misguide model interpretation.

Now resolved by cleaning feature set before SHAP and applying SHAP-based selection only on valid, imputed data.

One issue, was that the code for modeling was in the main() function which had gotten way too long and hard to read. This function had all of the training, testing, splitting/pruning, it had the model fitting in it, AND all of the scoring. I broke out the process of train/validate/test - and put that process into play for both the full model, as well as the SHAP-pruned model. Then I took the best r-squared from both runs and used the winning model.

Thursday, June 26, 2025

AI / ML - Using a DAG Workflow for your Data Pipeline

I have been trying to find a lightweight mechanism to run my increasing number of scripts for my data pipeline.

I have looked at a lot of them. Most are heavy - requiring databases, message queues, and all that goes with a typical 3-4 tier application. Some run in Docker.

I tried using Windmill at first. I liked the GUI - very nice. The deal killer for me, was that Windmill wants to soak in and make its own copies of anything it runs. It can't just reach out and run scripts that, for example, are sitting in a directory path. It can't apparently (could be wrong on this but I think I am right), do a git clone to a directory and run the content from where it sits. It wants to pull it into its own internal database - as a copy. It wants to be a development environment. Not for me. Not what I was looking for. And I only want my "stuff" to be in a single spot.

I then tried using Prefect. What a mess. You have to write Python to use Prefect. Right away, the SQLite Eval database was locking when I did anything asynchronous. Then, stack traces, issues with the CLI, etc. I think they're changing this code too much. Out of frustration I killed the server and moved on.

My latest is DAGU - out of GitHub - Open Source. Wow - simple, says what it does, does what it says. It does not have some of the more advanced features, but it has a nice crisp well-designed and responsive UI, and it runs my stuff in a better way than cron can do.

Here is a sample screenshot. I like it.

Tuesday, June 17, 2025

AI/ML - Feature Engineering - Normalization

On my quarterly financials model, the R² is awful. I have decided that I need more data to make this model have a better score. 3-4 quarters is probably not enough. That means I need to build something to go to the SEC and parse it.

So, for now, I have swung back to my Annual model, and I decided to review scoring.

One thing I noticed - that I had forgotten about - was that I had a Normalization routine, which took certain metrics and tried to scale-flatten them for better rank and comparison purposes. This routine, it takes certain metrics, and divides them by Total Assets. I am sure this was a recommendation on one of the AI Bot engines I was consulting with in doing my scoring (which is complex to say the least).

Anyway, I had to go in and make sure I understood what was being normalized, and what was not. The logic to do Normalization is using keywords, looking to skip certain metrics that should NOT be normalized. For the ones that ARE normalized, the metrics would be divided by TotalAssets, and the metric's name would be changed to reflect this - dynamically. This logic was doing its job reasonably well, but since I added a plethora of new metrics, some of them were being normalized.

So this is the new logic.
# --- Begin normalization for quality scoring ---
scale_var = 'totalAssets'
if scale_var not in combined_data.columns:
raise ValueError(f"{scale_var} not found in data columns!")

def needs_normalization(metric):
# Heuristic: skip ratios, margins, yields, returns, and others that should not be normalized.
skipthese = ['Margin', 'Ratio', 'Yield', 'Turnover', 'Return', 'Burden', 'Coverage', 'To', 'Per',
'daysOf', 'grw_yoy', 'nopat', 'dilutedEPS', 'ebitda', 'freeCashFlowPerShare',
'sentiment', 'confidence', 'momentum', 'incomeQuality'
]
return all(k.lower() not in metric.lower() for k in skipthese)

And this took me some time to get working properly. Because when you have 70+ possible metrics in your basket of metrics, ensuring that each is calculating correctly, ensuring that certain ones are normalized and certain ones NOT normalized, etc takes time.

Tuesday, June 3, 2025

AI / ML - Fetching Data, Quality Control, Optimization and Review

Most of my time lately has been "refining" the model.

For example, one of the things you need to really think about doing AI is where your data is coming from, and the quality of that data - and the price of that data.

Originally, I was using FMP for data. But the unpaid version only gives you access to 100 symbols. You cannot get far with 100 symbols, even if you collect scores of metrics and ratios on them. So when you build your initial model on FMP, say using the TTM API on 100 symbols, you will need to consider an ante-up for more symbols, or look for symbols and data elsewhere.

I have considered writing an intelligent bot to "scour" the universe of financial sites to pull in data on symbols. There are a lot more of these where you can get current data (i.e. stock price), but when it comes to statements, you are going to need to hit the SEC itself, or an intermediary or broker. If you hit these sites without reading their disclosures, you can get banned.

At a minimum, there is the rate limit. It is critical to understand how to rate limit your fetches. So using a scheduler, and running in batches (if they're supported) can really help you. Another thing is intelligent caching. It makes no sense to get into hot water fetching the same statement for the same symbol you just fetched an hour ago. Once you have a complete statement, you probably want to keep it in your pocket, and only update on a lower frequency if you decide to update old data at all.

So most of my time lately has been spot checking the data, building some caching and trying to do some general improvement on the processing and flow.

I found a couple of nice python tools one can use to load csv files: tabview, and vizidata. The latter is a bit more robust. Having a csv viewer is a game changer if you want to stay in a terminal and not "point and click".

With a tool like this, you can really start to backtrack into missing holes of data. I had one metric for example that was a typo (single letter) and had NO data for that metric. I had other issues with division by zero errors, Panda Dataframe vs Series issues, etc.

You also have to pay attention to these python algorithms, and what they spit out. The stuff may look like intimidating jibberish, but it's there for a reason and taking the time to really examine it can pay off quite a bit. For example, I decided to exclude certain metrics because they had circular influence. And when you make a change like that, the feature influences can change drastically.

Friday, May 23, 2025

AI / ML - Random Forest, Data Wrestling and Z-Scores

I was running my AI scoring algorithm, which takes as inputs a bunch of calculated metrics and ratios (features - X axis), and feeds those into a Decision Tree algorithm (Random Forest), against a price prediction (Y axis), which then prints out a report that shows how well the algorithm performed in general (R-squared), and a list of features sorted by their influence on the Y variable (price).

There are numerous algorithms that can do this - the simplest being a Linear Regression model. Decision Trees offer a faster and more efficient - and perhaps more accurate - alternative to linear regression, provided that the tree is pruned and managed correctly and that the tree doesn't get lopsided or imbalanced.

But I ran into problems, especially when checking the results and data carefully. And most of the issues, were related to the data itself.

Data Alignment
I noticed that the predictive z-scores for my features didn't "line up" when I printed them twice. Turns out, this was a data alignment issue. When you are using dataframes, and making copies of these dataframes and merging them, you need to be very very careful or a column can get shifted.

This alignment issue was affecting my model because the metric that WAS a profitability metric, was now being assigned to a solvency metric. Now that I have this fixed, things look much more sensible. But making sure your dataframes are aligned, is a hard-learned lesson.

Outliers
Other issues I ran into today had to do with the fact that when I printed a report out (a weighted scoring report), certain values were far and away better than others. I didn't understand this, and discussed it with the AI I am using as a consultant, who suggested I print out z-scores.

Well, if we look below, we have an evToEBITDA metric of 10.392 (insane value) on 2023 Paramount reporting data.

=== Z-scores for PARA on 2023-12-31 ===
Z-scores for PARA on 2023-12-31 in pillar 'Profitability':
grossProfitMargin: -0.263
operatingProfitMargin: 0.038
netProfitMargin: 0.029
returnOnAssets: -0.033
returnOnEquity: -0.006
returnOnCapitalEmployed: -0.089
returnOnTangibleAssets: 0.004
earningsYield: 0.008
freeCashFlowYield: 0.000
nopat_to_totalAssets: -0.170
operatingReturnOnAssets: -0.215
returnOnInvestedCapital: -0.031
ebitda_to_totalAssets: -0.384
operatingCashFlowToSales: 0.036
evToSales: -0.044
evToOperatingCashFlow: 0.054
evToEBITDA: 10.392
evToFreeCashFlow: 0.039

I audited the metrics and statements, and indeed this is correct - based on what Yahoo was returning to me on the income statement for that year (Normalized EBITDA). The unnormalized EBITDA was better, but in most cases, analysts use the Normalized value. You can't do one-offs in your code for things like this, so what do you do?

I couldn't drop the row, because I was already dropping so many 2020 rows of bad data (due to Covid I suspect). I drop rows that are missing >35% of metrics. When you get a row that has all of the values you need, you tend to want to use it. I don't have code that drops rows that don't have specific dealbreaker metrics - maybe I should, but there are so many metrics that generally I figure I can score and rank even if I am missing one here or there, even a fairly well-known or important one.

So - what do you do?

Winsorization. In other words, capping. It might make sense to invest the effort in Winsorizing all of the metrics and ratios. But for now, I am only doing it on these EBITDA ones.

Monday, May 19, 2025

AI / ML - It's All About the Data. Imputation and Clustering Algorithms

In spare time, I have been working on a Fintech project, which is done in conjunction with a thick book I have been reading called Machine Learning for Algorithmic Trading by Stefan Jansen.

I am mostly finished with this book, and have coded - from scratch - my own implementations of the concepts introduced in this book.

What have I learned thus far?

It is ALL ABOUT THE DATA. Most of my time has been scrutinizing the data. Disqualifying data, throwing away of imputing data that has no values, and Winsorizing/capping data values so that they don't skew into outliers.

Dates. Dates have always been a problem. Dropping timestamps off of dates properly so that date comparisons and date math work properly.

So far, a lot of what I have done is data clustering, using algorithms like DBSCAN, K-Means, Agglomerative, etc to find useful cluster patterns. Regression techniques to find correlations. The models and scoring so far are my own "secret sauce" Deterministic models. But I do plan to snap in some AI to do automatic weight adjustment soon.

Right now, I am using my own Deterministic scoring model - so it can be used as a comparative baseline. But eventually I will enhance this to be more dynamic through self-learning.

Grasping Technology