Grasping Technology: Open Source

Showing posts with label Open Source. Show all posts

Friday, September 5, 2025

I Need More Financial Quant Data - Techniques On How To Get It

I may have posted earlier about how finding enough data - for free - is extreeeemely difficult.

Even if you can find it, ensuring the integrity of it can cost time money and cycles that make it so much simpler to just let someone else deal with it and just subscribe. Problem is, I am "subscriptioned out". I can't keep adding layers upon layers of subscriptions, because that money adds up.

So - I work hard to see what data is available out there (i.e. Kaggle). It makes no sense to waste processing cycles and bandwidth if someone has already cultivated that data and is willing to share it.

I also have learned that there are a lot of bots out there that screen-scrape, using tools like Beautiful Soup. And if you are clever enough to use layers of (secure - I can't stress that enough) proxies, and morph your digital fingerprint (i.e. changing up browser headers and such), you can go out there and find data, and save it - and even check the integrity of the data by checking it against a couple or three sources.

And don't forget rate-limiting and Cloudflare tools - you have to figure out how to evade those as well. It's a chess game, and one that seemingly never ends.

Anyway - I decided I needed quarterly data. My XGBoost model just wasn't computing the way I wanted. I added more interactive features from macro data, and even a "graph score" (see earlier posts). And indeed, the score - the R-squared score - came up - but it didn't get to where I wanted it, and the list of stock picks were not stocks that I would personally make an investment in.

I decided to do two things:

Find superior data source(s) - preferably where I could get more and better quarterly data - for free.
Consolidate the code so that I didn't have to manage and sync code that was fetching on one frequency (annual) vs another.

I underestimated these tasks. Greatly.

I found a Github project that could hit different data sources. It had an OO design - and was probably over-engineered IMHO. But, I got what the author was after - by using a base class and then plugging in different "interface classes", you could maybe switch back and forth between different data sources.

So I tried it. And, lo and behold it didn't work. At first it did - for annual statements. But after I downloaded about 8k quarterly statements, I was horrified to realize that all of the quarterly statements were clones of the annual statements. Wow what a waste!!!

I checked - and the quarterly data was there indeed. The Github code was flawed. So, I fixed it. And even enhanced it.

This is the first time I have actually contributed to a community Github project. I am familiar with Git and Github, but if you are not doing this kind of thing on the regular, you have to re-learn topics such as branch development, Pull Requests, Merges, etc. And perhaps one of the most annoying things, is that the upstream owner of the repository may not like or agree with your changes.

In this particular case, the repo owner was using property decorators. Well, those work fine if you don't have parameters in your functions, because when you try to reference attributes of a class, it doesn't work if the calls have parameters in them. I had to blow those out. He didn't seem happy about it - but, eventually, he seemed to acknowledge the need to do that. Another difference of opinion had to do with the fact that he was using a lru_cache decorator on calls. I wasn't up to speed on this, and had to read up on it, and concluded that this was NOT the right situation to use caching, let alone lru caching. It can speed things up TREMENDOUSLY in the right use cases - but if you are batch downloading thousands of statements for thousands of symbols, you are not going to need to consult a cache for every symbol, so a cache like that actually creates overhead - and risk (i.e. running out of resources like memory if you don't have a max size on the cache).

In the end, I have some code that works. I had to do a rebase and update the pull request, and, if he doesn't take these changes the way I wrote them and need them, I guess I can always just create my own repo and go solo on this. I would rather not, because the repo owner does synch his repository with the pip installer which makes it easy to download and update.

Thursday, July 25, 2019

ONAP - Just Too Big and Heavy?

I have been working with Service Orchestrators for a while now. Here are three of them I have had experience with:

Heat - which is an OpenStack Project, so while OpenStack can be considered the VIM (Virtual Infrastructure Manager), Heat is an Orchestrator that runs on top of OpenStack and allows you to deploy and manage services
Open Baton - this was the ORIGINAL Reference Implementation for the ETSI MANO standards, out of a Think Tank in Germany (Frauenhofer Fokus).
ADVA Ensemble - which is an ETSI-based Orchestrator that is not in the public domain. It is the IPR of ADVA Optical Networks, based out of Germany.

There are a few new Open Source initiatives that have surpassed Open Baton for sure, and probably Heat also. Here are a few of the more popular open source ones:

ONAP - a Tier 1 carrier solution, backed by the likes of at&t.
OSM - I have not examined this one fully. TODO: Update this entry when I do.
Cloudify - a private commercial implementation that bills itself as being more lightweight than the ONAP solution.

I looked at ONAP today. Some initial YouTube presentations were completely inadequate for allowing me to "get started". One was a presentation by an at&t Vice President. Another was done by some architect who didn't show a single slide on the video (the camera was trained on the speaker the whole time).

This led me to do some digging around. I found THIS site: Setting up ONAP

Well, if you scroll down to the bottom of this, here is your "footprint" - meaning, your System Requirements, to install this.

ONAP System Requirements

Okay. This is for a Full Installation, I guess. The 3Tb of Disk is not that bad. You can put a NAS out there and achieve that, no problem. But 148 VCPU???? THREE HUNDRED THIRTY SIX Gig of RAM? OMG - That is a deal killer in terms of being able to install this in a lab here.

I can go through and see if I can pare this down, but I have a feeling that I cannot install ONAP. This is a toy for big boys, who have huge servers and lots of dinero.

I might have to go over and look at OSM to see if that is more my size.

I will say that the underlying technologies include Ubuntu, OpenStack, Docker and Mysql - which are pretty mainline mainstream.