Friday, October 20, 2023

Deep Q Learning - Neural Networks - Training the Model Takes Resources

I now am starting to see why those companies with deep pockets have an unfair advantage in the not-so-level playing field of adopting AI.  Resources.  

It takes a LOT of energy and computing resources to train these Artificial Intelligence models.

In Chapter 11 of AI Crash Course (by Hadelin de Ponteves), I did the work. I downloaded, inspected, and ran the examples, which are based on Google's Deep Mind project. The idea is to use an AI to control server temperature, and compare this with an "internal" (no AI) temperature manager.

What you would do, is to train the model (first), and it would produce a model.h5 file, that would then be used when you ran the actual model through testing. 

The problem, though, is that on my rather powerful Mac Pro laptop, the training would never run. I would return HOURS later, only to see [ killed ] on the command prompt. The OS apparently was running out of resources (memory probably).

So I started tinkering with the code.

First, I reduced the number of epochs (from 25 to 10). 

#number_epochs = 25  

number_epochs = 10

Which looked like it helped, but ultimately didn't work.

Then, I reduced the number of times the training loops would run. When I looked at the original code, the number of iterations was enormous.

# STARTING THE LOOP OVER ALL THE TIMESTEPS (1 Timestep = 1 Minute) IN ONE EPOCH

while ((not game_over) and timestep <= 5 * 30 * 24 * 60):

This is 216,000 loop iterations in the inner loop, and of course this needs to be considered from the context of the outer loop (25, or, adjusted down to 10 as I did).  So 216,000 * 25 = 5 million, 400 thousand. If we reduce to 10 the number of Epochs, we are dealing with 2 million, 600 thousand.

I don't know how much memory (Heap) is used over that many iterations but on a consumer machine, you are probably going to tax it pretty hard (remember it has to run the OS and whatever tasks happen to be running on it).

I was FINALLY able to get this to run by reducing the number of Epochs to 10, and reducing the steps to 5 * 30 * 24 (3600). And even with this drastic reduction, you could see the benefits the AI performed over the non-AI temperature control mechanism.

No comments:

Fixing Clustering and Disk Issues on an N+1 Morpheus CMP Cluster

I had performed an upgrade on Morpheus which I thought was fairly successful. I had some issues doing this upgrade on CentOS 7 because it wa...