Read time: 13 minutes

Reinforcement Learning & pricing: a complicated love story

Reinforcement Learning (RL) is a recurrent topic here at Tryolabs, either internally while designing solutions for our clients or working with them. Particularly when evaluating options for Price Optimization problems, we’ve considered and studied its feasibility many times, under different scenarios.

I can identify at least two important reasons for that topic arising so often. First, this is an exciting field of AI, so we all want to learn more about it. And added to that, it’s quite straightforward to map the pricing problem to the Reinforcement Learning framework: set reward = profit and try to maximize that (we’ll explain this better below). And it works! You can play this simple game we developed to do precisely that, in case you’re curious.

But in real-world scenarios, in price-optimization projects for our clients, we learned that classical demand forecasting approaches provide an excellent place to begin working. In further steps, RL models can be built on top of an existing machine learning system, as will be shown with a couple of examples.

In the following lines, I’ll explain the main reasons that led us to prefer other options over RL for these cases. But since we’re very interested on it and have no intention to be labeled as its public enemy, I’ll cover other potentially groundbreaking use cases for it. These cases are still related to pricing systems.

A complicated love story

What’s the story with Reinforcement Learning?

If you’ve been snooping around the Machine Learning world lately, you may know that RL has been on the bleeding edge of research and development. Advancements in this area are astonishing, as we all have heard.

One of RL’s most interesting aspects is that it can learn without the need for expert domain knowledge other than the system rules, i.e., some representation of the state of the environment, with available actions in each state, and finally the rewards/penalties received when moving from a state to the next one. That’s what happens when computer scientists meet Pavlov.

Thinking machines

The link between ‘the way we believe we think’ and this field of AI is clear. Behaviorism (the natural-intelligence counterpart of RL theory) is not considered the ultimate way to explain how the mind works, but it accounts for a large part of the advancements in psychological research.

That similarity in the way people and animals learn from their environment seems to bias our common sense and make us believe that we have finally solved the problem of intelligence and that RL is now the only path to follow.

But should we all follow this particular path to solve all our Machine Learning problems? Well, intense research pushes this field forward, and from that point of view, we are in a very sweet spot where computer science meets psychology (among many other fields, you can just search for ‘behavioral economics’ as an example). Discoveries on one field inspire research on the other, and from that synergy, we all get excited. Computer scientists think that they should be awarded psychologist certifications. Psychologists and economists learn to program in Python and R. All that motivation is truly powerful, which is awesome, but it doesn’t mean that we should all throw away our previous tools.

Analytical tools

The great power of many of our analytical tools comes from the fact that they enable different ways to look at the problems. Sometimes these tools support, and other times, they completely override our intuition. Combining tools, processing different visualizations, and leveraging the best-performing machinery in each specific domain, brings us to the best results if we do it right.

We should ask ourselves, what different visualizations, components, and sub-problems will help us optimize a pricing policy? 

Focusing on solving a price optimization problem may seem shallow compared to the deep mysteries of how the mind works. But consider this if you’re starting to feel bored: prices are probably the most important signals nowadays, driving this giant worldwide network of human exchange and collaboration that makes up the global economy.

So, if you feel more interested in understanding how pricing works, let’s get into it.

What’s the story with pricing?

Resuming our reasoning about the various ways to visualize problems, the power of combining tools and solving sub-problems, we get to one of the central points we want to argue here:

A great pricing system is first a good forecasting system.

Depending on where you come from, you might find this statement either obvious or unacceptable. In some cases, forecasting may be so inaccurate that it’s not feasible for a pricing system, and thus a different approach is needed. I would argue that this was the reason to add the “great” qualification to the previous statement. The point is: when available, the best pricing system also provides forecasting, and it’s based on it.

Let’s explain where this sort of *if and only if* condition comes from and see if we can agree on a few things.

Good forecasting yields pricing strategies

Let’s start with the less controversial part of our statement and explain why forecasting enables pricing. It’s because if you can predict how many sales you’ll have for each product, as a function of the price, and you make it with good accuracy, then you are almost at a high-school-level calculation to get the optimum price for this product:

profit(price) = Predicted Sales(price).(price - cost)

Go through the entire price range, get the predicted sales for each price point, calculate the profit, and then just find the point where this profit function has the maximum value.

As you can see, we’re assuming that we can forecast sales for any price we end up setting, among other variables that might have an influence in sales. So hold on, as there are many subtle things implicit behind the innocent “good forecasting” requirement—hiding and lying in wait to hit us, naive developers, with a shot of reality that could ruin our whole system.

  • Supply chain management: What if we run out of stock in the middle of our predicted sales?
  • Outliers prediction: Black Friday yay! Not so funny if you’re trying to predict sales and didn’t think about it.
  • Product cannibalization: You have 2k products to price and can’t check them one by one to find out that you’re trying to sell a package of two -deodorants, soap, cookies, use your imagination- more expensively than the two separate units. You’ll sell near zero, very likely.
  • Exploration/exploitation tradeoff: What if we’re always selecting prices near the historically set prices? Maybe the system gets high forecasting accuracy for the prices that it sets live. Still, we never discover that the rest of the predicted demand curve was wrong, underestimating sales for some prices that were never previously tested. If some new prices are never explored, there’s no way to tell that the selected prices are the best. This dilemma is very well known in RL theory, but we’ll have to deal with it here even if we don’t use RL.

Each of the above is an entire research area, and this combination is particular to sales forecasting. As such, these issues won’t get resolved out of the box when using a generic forecasting solution.

So, “good forecasting,” uh? There’s a lot more to it than meets the eye.

Great pricing requires good forecasting first

A perfect Reinforcement Learning scenario

After all, considering the important challenges in forecasting listed above, it looks like we’re over-complicating things, trying to forecast sales first only to calculate the optimum price later.

That’s probably the main concern for those that come from RL and probably found our previous statement “unacceptable” (about needing forecastingto solve the pricing problem). Because they will say:

“I just want to maximize revenue or profit, and I will set that as my reward function. That’s all I need for RL. I may even solve this using a simple multi-armed bandit.” - Hypothetical advocate for RL

Well, that might be a lot to infer about what someone would say, but I’m sure that the first argument would be around that. And it’s a valid argument. There’s no fallacy in that statement, except for the last part, because if you read the challenges above, you can’t think of a simple multi-armed bandit for anything else than a baseline or a warm-up to start thinking about the problem.

Many papers about using RL for pricing set directly revenue or profit as the RL algorithm’s reward function. Some were even validated in real e-commerce, like in this very interesting paper from Alibaba researchers, among other papers. They don’t try to predict sales. They try to maximize revenue/profit.

Where RL falls short

The problem is that in most real scenarios, it’s not good enough to have a black-box like oracle that only provides a numeric output (the price you need to set for each product), then sit to wait for the results and blindly trust that you’re doing the best you can.

Here are some important things that a real business would need in order to actually take advantage of a pricing system:

  1. An estimate of future sales so that the stock replenishment team can keep up. Or at least, a way to tell our system that it can’t sell above a certain stock limit in a given period. This is the supply chain management mentioned before.
  2. A clear metric to check that the system is working reasonably well and it’s not just a sometimes-lucky random number generator.
  3. A way to diagnose and explain results, to find room for improvements or serious problems. Is there cannibalization between similar products? Out of stocks? Is the system underestimating sales for some reason? Any other systematic error in the predictions?

If you only had a magic box that told you the price that you should set, in order to get maximum revenue, you might choose to believe it. But you won’t know how well it’s actually doing, compared to the prices you didn’t test.

On the other hand, when your system also provides some kind of forecast (aside of the optimal price), you can easily test your predictions against reality.

Note that you solve all of the above issues when your pricing actions are derived from a sales forecasting model, as we’ll explain on each of the following sections.

Stock management: a big deal

For the case study in the paper mentioned above from Alibaba (and others like this one), the main focus was to optimize price for a fixed stock amount. The opposite alternative is also usually found in research papers: assuming infinite stock, stock replenishment issues are not considered. In the paper, they mention that for their case (online marketplace with multiple sellers for each product), predicting sales is not possible due to very unstable market conditions. They probably tried to forecast something at first and failed to obtain anything meaningful in their scenario, so that could be a good reason to explore a different approach, after confirming that situation.

It seems like the majority of the proposed RL models rely either on fixed stock or the assumption of infinite stock, which is the opposite extreme.

But in general, real cases aren’t in either extreme. Stores can usually adapt their stock policy to a certain extent for the sake of profitability. But finding a good stock replenishment policy requires some type of forecasting.

Not adapting the replenishment policy can cause either costly overstock or worse: out of stocks. In the case of an online store with multiple sellers, this might not be a problem. However, in stores with a constant flow of goods, if you don’t have the product on the shelf, you won’t sell it. That is certainly inefficient because clients are not getting what they want. When added up for many products, unexpected out of stocks are a very important issue in our experience for the performance of a pricing system.

Price optimization can lead to significant increases in profitability. The stock replenishment policy (whichever it is) must be considered part of the problem, sooner or later. The whole problem can be viewed as finding the price that produces the highest profit, according to the replenishment capacity or order arrival rate, as described in the ‘Market Microstructure Theory’ book by Maureen O’hara.


The effort required to adapt a working system to different scenarios, or just for continuous improvement, is directly related to the so-called “explainability” of the model. How easy is it to map the different components of the model to understandable concepts or features that are observable in the real world?

How will you know if profit or sales were underestimated if they were not predicted?

A system that’s not doing any forecasting and only outputs the best price you should set, is quite hard to understand and evaluate.

Forecasting accuracy

There are many different approaches to use RL for pricing, and some of them actually have an underlying sales prediction, maybe in an implicit way. But now, if sales are predicted somehow in an RL system (e.g., related to the Q function value), then this is a forecasting system after all. And for those cases, the question is then: if your system is in some way a forecasting model, is it the best one?

And that’s where the RL option doesn’t seem to be the one to go for now, because so far, the best performing forecasting systems are not based on any RL model but rather on combined methods. The usual approach is to start with something based on gradient boosted machines, namely XGBoost or LightGBM, among others.

But that situation might change, and we need to stay tuned! That means, for example, that we need to keep an eye on the best performing models from competitions like M5 forecasting accuracy.

The reconcilement

We’re proposing the idea that Reinforcement Learning isn’t the best approach to do forecasting yet, and ideally, a pricing system is based on forecasting. For that part of the pricing system, we can move on and hang out with some boosted tree regressor that satisfies all our forecasting needs. They are straightforward to implement and do a great job.

But don’t get me wrong: we are (and I am) very interested in Reinforcement Learning. Again, I’ll suggest that you check the pricing game we did just for fun to reinforce this point.

In industry, it must be considered when it fits as an option because it could potentially be a breakthrough. We did consider it as an option for pricing. It’s just that we preferred other alternatives for the very specific reasons explained before.

At this point, you might be wondering: in which cases do you think it could be a good option, then?

Pricing challenges where RL seems like a good fit

Following our reconciliatory mood, I’ll provide some examples that are part of a pricing system where Reinforcement Learning seems like a great option to consider.

Going back to the challenges mentioned before in this post, solving the pricing problem successfully also requires:

  • Ensuring that there’s enough stock available, coherent with our predictions.
  • Detecting similar products to be aware of/prevent potential cannibalization.
  • Handling price exploration/exploitation and ensuring that the data being gathered is helping us improve the results.

Could we use Reinforcement Learning to tackle these problems?

If you’re curious about it, we’re providing some ideas. 💡


This post’s underlying idea is to explain why we consider Reinforcement Learning as a second step in developing a pricing system and why we’ve so far preferred forecasting models adapted to the characteristics of sales predictions and used that to find the optimal prices.

Presenting also some imaginary use cases as an optional downloadable, the idea is to help the reader ground some of the most important RL concepts, and build an intuition on how it works.

Stay tuned, and hopefully, we’ll have more posts about this fascinating topic and what we can do to help in your actual use case!

Like what you read?

Subscribe to our newsletter and get updates on Deep Learning, NLP, Computer Vision & Python.

No spam, ever. We'll never share your email address and you can opt out at any time.
Comments powered by Disqus