Monday, September 22, 2025

I Have More Data Now - Enough for an AI RNN LSTM Model?

I have a LOT more data now than I did before. And an advanced architecture to process it.

Should I consider an RNN?

I knew I couldn't really pull it off with the Annual data I had - because by the time you split the data for training, validation, and testing there isn't enough to feed the algorithm.  

But - Quarterly! I have a LOT of quarterly data now, many statements per symbol across quarter-dates. ~70K rows of data!!!

So let's try doing an LSTM....I wrote a standalone LSTM, using Keras. Just a few lines of code. 

One important note about this! 

Do NOT mix your data processing, and or your XGBoost code, with neural network code!!! ALWAYS create a brand new virtual environment for your neural RNN code, because if you choose Keras or the other competing frameworks, they will require specific versions of Python libraries that may conflict with your data processing and/or XGBoost libraries!

Now. With that important disclaimer, the small sample of code. We will highlight in blue since my blog tool apparently has no code block format.

# -------------------------
# Train/test split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

# -------------------------
# Build LSTM model
# -------------------------
model = Sequential()
model.add(LSTM(32, input_shape=(SEQ_LEN, len(feature_cols)), return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dense(1))  # regression output

model.compile(optimizer='adam', loss='mse')

# Early stopping
es = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# -------------------------
# Train
# -------------------------
history = model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=50,
    batch_size=16,
    callbacks=[es],
    verbose=1
)

# -------------------------
# Evaluate
# -------------------------
y_pred = model.predict(X_test).flatten()
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"R²: {r2:.3f}, MAE: {mae:.3f}")

Well, how did it go?


The previous XGBoost r-squared value, was .11-.12 consistently. Now, we are getting .17-.19. This is a noticeable significant improvement!
 

No comments:

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was re...