Reinforcement Learning & Risk of Over-Optimization

Applying reinforcement learning to trading can become a truly powerful tool when it works well. However, there are several important cautions to consider, with the most prominent being the risk of over-optimization. Let’s explore what optimization truly is, and how to avoid the pitfalls of over-optimization.
Apr 12, 2025
Reinforcement Learning & Risk of Over-Optimization

Optimization and Over-optimization

Optimization refers to the process of maximizing (or minimizing) an objective function within a set range of constraints.

For example, imagine designing a sturdy yet inexpensive cup. In this case, if you set a constraint that the cup’s strength must exceed a certain level, variables such as the cup’s size, thickness, and material become adjustable parameters. Optimally designing the cup means carefully tweaking these variables so that, within the constraints, the cost of the cup is minimized.

Now, what happens if you focus solely on reducing cost? If you opt for a material that is cheap but fragile, you might need to compensate by making the cup’s thickness abnormally high to meet the strength requirement. After several rounds of trial and error, you would eventually figure out how to use the right amount of suitable material to produce an optimal cup—one that is sufficiently sturdy yet reasonably priced.

While you could arrive at an optimal solution through empirical trial and error, modern approaches are rooted in mathematics. By clearly defining the constraints and the objective function mathematically, one can sometimes directly solve the equations in simple cases or use calculus and various algorithms to approximate the solution.

In scenarios where there are too many variables for a human to analyze directly, optimization can be particularly effective. However, it is crucial to fully understand the input variables, output values, and the overall structure of the process when employing optimization techniques. Otherwise, you may encounter extreme and unrealistic outcomes—much like a cup designed with an abnormally thick structure using an overly weak material. This phenomenon is known as over-optimization.

How does this apply to trading? Optimization can be employed in various areas, whether it’s in determining the allocation ratios in a portfolio or fine-tuning parameters to decide the timing of trades in algorithmic trading. In such cases, it is essential to have a deep understanding of the input variables, output results, and the process and structure involved. Without this understanding, one might easily fall into the trap of over-optimization.

What is ‘Reinforcement Learning’?

Reinforcement Learning, Environment, Agent, State, Action, Reward
Structure of Reinforcement Learning

Reinforcement Learning (RL) is an artificial intelligence technique where an agent learns autonomously through its interactions with the environment, ultimately discovering the best possible actions. The core elements of reinforcement learning are the state, action, and reward. The agent learns a policy aimed at maximizing rewards by engaging in trial and error. By adjusting future actions based on past outcomes, reinforcement learning enables progressively better decision-making over time.

Unlike approaches that seek a precise mathematical solution, reinforcement learning discovers optimal actions through experience. This makes it particularly effective in complex and probabilistic environments like trading, where it can learn and adapt to subtle market patterns and evolving conditions—challenges that traditional optimization methods often struggle to capture.

However, the flexibility of reinforcement learning can also increase the risk of overfitting, where the model becomes too finely tuned to past data.

Reinforcement Learning as an Overfitting Machine – Simulation Experiment

Let’s conduct a simulation experiment to see what happens when reinforcement learning is pushed to the extreme of over-optimization. We trained a reinforcement learning agent for both LONG and SHORT trades using BTC/USDT 15-minute candlestick data spanning from August 2023 to January 2025—a period of one and a half years.

Results of Training Reinforcement Learning Model

The image below shows a backtest graph where the trained model was applied to the same training data. Over the 1.5-year period, the assets increased by tens of thousands of times compared to the initial capital. Even though the model was directly applied to the data on which it was trained, how could such a graph possibly emerge?

In Sample Result

Cause 1 – Error in Constraints

Although trading fees were taken into account, other common constraints inherent in trading were not considered. The reinforcement learning agent was trained under the unrealistic assumption that every trade—regardless of its size—executes at the current price with a 100% fill rate, and the backtest simulation was carried out under this same assumption.

When these flawed assumptions are compounded over time due to the effects of compounding, the resulting backtest produces an abnormal return curve. This serves as an example of over-optimization caused by improperly set constraints. However, this is an issue that can be gradually improved with time by refining the simulator. Could there be any other issues?

Cause 2 – Overfitting to Historical Data

In a short period where the distortions caused by the erroneous constraints haven’t fully compounded, let’s examine how the reinforcement learning agent responds to market price fluctuations. It is especially crucial, if trading in real markets, to verify how the RL agent behaves on new data that was not used during training.

First, we’ll apply the trained model to the four days of data immediately following the training period.

backtest, reinforcement learning, trading
Out-of-Sample Result (1)

During the first four days, we recorded approximately a 13% return. With numbers like this, even if the improvements in the constraint issues slightly reduce the overall returns, wouldn’t you say that we could still expect pretty good profits?

Let’s take a look at the following four days.

backtest, reinforcement learning, trading
Out-of-Sample Result (2)

This time, we observed a loss of around 2%. Should we be grateful that it wasn’t a huge loss?

Next, let’s review another four-day period.

backtest, reinforcement learning, trading
Out-of-Sample Result (3)

In this interval, the profit was just under 1%. Although the model made tens of thousands of times the return over 1.5 years in the historical training data, it appears to struggle to generate profits on brand-new data.

Now, let’s run a backtest over a period of about one month that wasn’t used during training.

backtest, reinforcement learning, trading
Out-of-Sample Result (4)

After the initial four days with significant gains, the market seemed to move sideways—perhaps due to low volatility—and then losses began to accumulate again once volatility returned.

Summary of Experimental Results

Within the historical data used for training, it’s possible to develop reinforcement learning trading agents that show unbelievable returns.

However, when tested on brand-new data, the outcome can be either profitable or result in losses. No matter how consistently high the returns were in the training dataset, the fact that losses can occur on unseen data indicates that the reinforcement learning agent may have been overfitted to past data. Resolving overfitting is a crucial challenge if we are to deploy reinforcement learning agents in live trading.

On a somewhat positive note, the simulation results suggest that profits are possible immediately after the training period. The agent is capable of recognizing subtle market patterns that are almost imperceptible to human eyes, seizing trading opportunities based on these signals. While these minute patterns may change continuously, it seems that they tend to persist for a while right after the training data period.

Key Structures in Reinforcement Learning

In traditional optimization problems, the process involves mathematically defined objective functions that are maximized or minimized under certain constraints. A thorough understanding of the input variables, output values, and the overall structure of the problem is critical in avoiding overfitting.

Reinforcement learning, on the other hand, learns optimal actions through an empirical, trial-and-error approach in complex and probabilistic environments rather than relying on a clear-cut mathematical solution. So, how can overfitting be avoided in reinforcement learning? It requires a clear understanding, design, and application of its core components—state, action, and reward.

Likewise, when applying reinforcement learning to trading, it’s essential to establish realistic constraints and continuously monitor, review, and refine the reinforcement learning model and its structure in order to prevent overfitting.


Introducing QMELLION

QMELLION is focused on researching reinforcement learning-based trading algorithms. We will be sharing our related research insights and achievements on this blog.

Share article