Wed, Mar 27, 2019
There's some great theory around about pricing with machine learning (ML) and, in particular, the importance of the "right" data to build a successful ML model.
In practice, though, we've seen that there's a lot of confusion around the data types and formats that retailers can use to automate pricing or to implement price optimization systems using ML.
That's why we came up with 5 steps every company can take to prepare its data for price optimization. Follow them and you'll be ready to train your own ML model. Let's get started!
There's a wide range of useful information that retailers already have at their disposal when implementing a pricing solution with ML.
Kick off your data preparation project by considering all the inside information that you have that could be of value to your ML pricing solution. If you are unsure whether a variable could have an influence on a price or not, add it to the list. Further down the road, there will be a chance to prioritize and filter what data you’ll actually utilize.
Define example products. What are some typical products you sell?
Check the pricing of said example products. What do you currently charge for these products or, in the case of dynamic pricing, what is that range?
Identify internal data sources. What internal data do you have that you considered when you determined or changed this product's price? What internal data sources might you not be considering now but may eventually include down the road? Make a list. Insightful information can include product descriptions, product pictures, production costs, existing prices, or historical data.
Besides the internal data sources that you summarized in Step 1 above, there's external information that could affect customer demand and impact the determination of the best price. Examples include weather forecasts, exchange and inflation rates, local demand, and competitor pricing. These might be external data sources since retailers normally don't collect this type of information.
However, pricing systems that rely on machine learning have the sizeable advantage of being able to incorporate these factors. So, go wild and think of all the data you don't have internally that could have an impact on your customer's behavior and your pricing.
Define additional pricing factors. What are all the factors that you believe influence your buyers’ behavior and your pricing, but for which you'd need to obtain external data?
Find data sources. What are the possible sources for this information? Is the data publicly available? Are there third-party providers? List them out.
Check access to third-party data. If it’s possible to get the data from third-party providers, how is it accessed? Is there an API or would the provider agree to share the data in some other way with you? Could you scrape the data? If there's no way to access the data, you may want to start collecting the data yourself for future reference.
Tip: When choosing data providers, look for reputable organizations with relatively large infrastructures that are less likely to change over time and ideally offer web service APIs. This will save you future retooling when adapting to changes they make.
This is the time for a reality check regarding all that glorious, shiny, ideal data. It is a highly valued thing and usually not accessible without investing some considerable resources. There are financial and labor costs associated with the collection and preparation of data, and these require you to prioritize desirable data sources according to their capacity to determine the best pricing for your products.
Determine the impact of pricing factors. What's the impact of these factors from Steps 1 and 2 on your product pricing? Are your product sales strongly seasonal? Are related products frequently bought together? Try to come up with a low, moderate, or high impact ranking for each of them and you will begin to get an idea of what you can gain from all your information.
Determine the costs of collecting data. How hard will it be to add the factor to your current dataset given all of the external data? Can you query it from your database? Do you need to hire someone to collect that data for you? Again, it would be helpful to categorize the data using low, moderate, and high rankings based on how challenging it would be to incorporate it into your system.
Prioritize your data. What are the most valuable data sources according to their expected value and estimated costs, given your specific budget? Most of the time it's not necessary to have super duper completely perfect information to get good results, so if that's the case, don't feel discouraged! Start small and work your way up.
It's far easier to train ML models if you have structured data. A basic structure that is universally used is the tabular format, where rows represent different items sold, and columns the attributes of each item. Here's how to formulate your own product-price table:
Unstructured information of shirts sold:
Product SKU | Unit price | Date sold | Fabric | Size | Competitor Unit Price | Description |
---|---|---|---|---|---|---|
0001 | 50 | 19/01/08 | Silk | S | 40 | Winter 2018, long sleeved silk shirt. |
0002 | 40 | 18/08/30 | Cotton | XL | 35 | Summer 2016, short sleeved cotton shirt. Dinosaur print. |
0003 | 30 | 19/03/25 | Linen | M | 50 | 40 Spring 2015, short sleeved linen shirt. Floral pattern. |
We can help you evaluate your datasets.
Many retailers store data about their products in free-form text, such as the text in "description" in the example table “Unstructured information of shirts sold” above. Thing is, free-form text is not easily understood by machines and should be avoided as often as possible.
So, if you have free-form text, try to identify text patterns in your product descriptions. Is any of the information in your descriptions repeated for different products? In the example above, repeat information would be the year of the collection and information about short/long sleeves. These details can be extracted to separate, structured columns.
It's also very common to find categorical data among free text entries. You will likely have a list of brands that you deal with. Besides extracting the brand from the description and making it a separate field, you can provide a choice input and make data entry easier, as well as avoid typos and other mistakes.
Here’s what you can do if you have free-form text with repeat information:
Continuing with the example table from Step 3, the table below shows what structuring the original description field would look like. Notice here that collection is now an ordinal field while sleeve length and pattern/print are categorical ones.
Structured information of shirts sold:
Product SKU | Unit price | Date sold | Fabric | Size | Competitor Unit Price | Collection | Sleeve length | |
---|---|---|---|---|---|---|---|---|
0001 | 50 | 19/01/08 | Silk | S | 40 | Winter 2018 | Long | None |
0002 | 30 | 18/08/30 | Cotton | XL | 35 | Summer 2016 | Long | Dinosaur |
0003 | 40 | 19/03/25 | Linen | M | 50 | Spring 2015 | Short | Floral |
Data is at the heart of every successful ML model. This is why we cannot emphasize enough the importance of storing your data. Here are some typical scenarios you should keep in mind when doing so:
Sometimes, there's information we use for decision making at a point in time and then forget, thinking we won't use it in the future. Well, with ML, just like with any learning, this works a bit differently. Those reasons you based your decision on can be useful input for an ML algorithm. Maybe when pricing reveals a certain trend, a correction is in order; perhaps one method of measurement provides more accurate results; maybe the reason this data is off is because the last quality check was performed a long time ago.
All of these inputs can be of use in understanding and modeling a process more precisely. Thus, whenever possible try to keep track not only of what you did but also why and how you did it.
The amount of data needed to train an ML model depends on individual business needs and unique goals. Generally-speaking, the more data you have, the better. Data collected over the last 3-12 months is a good start, whereas data from more than a year will allow you to account for seasonal pricing factors.
Once you've gathered enough data, it can be used by engineers to train the ML model to start automating and optimizing your pricing process. If you'd like to learn the next steps of the price optimization process, you can check out this detailed article.
We hope this post helps you transform your data into usable information and brings you one step closer to implementing a successful pricing solution using ML.
© 2025. All rights reserved.