Tuesday, June 3, 2025

AI / ML - Fetching Data, Quality Control, Optimization and Review

Most of my time lately has been "refining" the model. 

For example, one of the things you need to really think about doing AI is where your data is coming from, and the quality of that data - and the price of that data. 

Originally, I was using FMP for data. But the unpaid version only gives you access to 100 symbols. You cannot get far with 100 symbols, even if you collect scores of metrics and ratios on them. So when you build your initial model on FMP, say using the TTM API on 100 symbols, you will need to consider an ante-up for more symbols, or look for symbols and data elsewhere.

I have considered writing an intelligent bot to "scour" the universe of financial sites to pull in data on symbols. There are a lot more of these where you can get current data (i.e. stock price), but when it comes to statements, you are going to need to hit the SEC itself, or an intermediary or broker. If you hit these sites without reading their disclosures, you can get banned. 

At a minimum, there is the rate limit. It is critical to understand how to rate limit your fetches. So using a scheduler, and running in batches (if they're supported) can really help you.  Another thing is intelligent caching. It makes no sense to get into hot water fetching the same statement for the same symbol you just fetched an hour ago. Once you have a complete statement, you probably want to keep it in your pocket, and only update on a lower frequency if you decide to update old data at all.

So most of my time lately has been spot checking the data, building some caching and trying to do some general improvement on the processing and flow.

I found a couple of nice python tools one can use to load csv files: tabview, and vizidata. The latter is a bit more robust. Having a csv viewer is a game changer if you want to stay in a terminal and not "point and click".

With a tool like this, you can really start to backtrack into missing holes of data. I had one metric for example that was a typo (single letter) and had NO data for that metric. I had other issues with division by zero errors, Panda Dataframe vs Series issues, etc. 

You also have to pay attention to these python algorithms, and what they spit out. The stuff may look like intimidating jibberish, but it's there for a reason and taking the time to really examine it can pay off quite a bit. For example, I decided to exclude certain metrics because they had circular influence. And when you make a change like that, the feature influences can change drastically. 

 

No comments:

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was re...