I got into some trouble with my model where it blew up to 150-180 features. When I finally took the time to really scrutinize things, I noticed that I had duplicates of raw metrics alongside their sector-z-scored components. I don't know how that happened, or how those sector-z-scored components got into the feature set submitted to XGBoost and SHAP. Probably logic inserted into the wrong place.
I wound up removing all of the sector-z scored metrics for now.
But this highlighted a problem. Semantics.
I had some metrics that needed to be normalized out for scoring and comparison purposes. Mostly raw metrics, and to do this, we divided the value by TotalAssets. For metrics that we did NOT want to do this to, we had some exclusion logic based on Regular Expressions (regex). We looked for metrics that had "Per" and "To" (among others).
This seems to have fixed our set of features, and it is so much better to see 30 features of 80 selected instead of 100 features of 180. It reduced a ton of noise on the model, improving its integrity.
Now I do need to go back and examine why we did the sector z-scores initially, to see if that is something we do need to engineer back in. I think we need to do that in the cases where we are producing a Top-X-By-Sector report.
No comments:
Post a Comment