In data science, as in the rest of life, there are multiple ways to get something done. And the route you choose always comes with trade-offs. This is absolutely true when it comes to media mix modeling, where the choice of model locks you into the strengths and weaknesses of that model.
Media mix modeling is a great tool for marketers looking to allocate their budget more intelligently. By studying how you’ve allocated your budget in the past and the results those choices have produced, a media mix model can provide you with valuable insights, such as
- understanding how important a role each marketing channel has played in the past
- discovering the optimal way to allocate future budgets
- predicting the result of a given allocation
The problem? It’s hard to pursue more than one of those insights at a time. For example, if optimizing future budgets is your key focus, you’ll need to choose a media mix modeling technique that prioritizes that. But that model won’t be as effective at understanding channel effectiveness or predicting campaign outcomes. And a model that’s great for prediction won’t be as effective for understanding channel value or optimization.
In this post, we’ll walk through the tradeoffs of three media mix modeling approaches: Bayesian linear regression, gradient boosted trees and deep learning.
Bayesian Linear Regression
Bayesian linear regression is an extension of linear regression that conducts its business in the realm of Bayesian statistics. It allows us to gain a much deeper understanding of the parameters in our model.
For example, while linear regression results in single-valued coefficients scaling each feature, Bayesian Linear Regression produces distributions for each coefficient.
This is useful because the coefficients for each feature tell us how important that feature is to the model. So, in the case of a marketing mix model, these coefficients are telling us how important each of our marketing channels are with respect to all other channels and with respect to other things like brand awareness and seasonal effects. When the coefficients become distributions instead of single-value numbers, we gain additional information about the uncertainty of these coefficients.
With a single-valued coefficient, all you can say is “Paid social has the largest coefficient, so it must be the most important channel in driving sales.” With the full distribution, you can see that while paid social may have the largest coefficient, it may also have an extremely broad distribution, showing the model isn’t sure what to do with this channel.
This gives you the information to ask yourself if you want to pump resources into the channel that appears to have the largest impact, but huge uncertainty. Or do you want to shift resources towards the channel that has the second largest impact and is very well understood in the model? Bayesian linear regression provides the additional information you require to feel confident in the analysis of your historical data.
I said I was going to discuss trade-offs with each of these approaches. Well, now it’s time for the bad news, and this one should be pretty obvious. Despite all the advantages of introducing Bayesian statistics to our linear regression model, it’s still a linear regression model. If your data contains nonlinear trends, this model will struggle. For all the simplicity and ease of interpretation, linear regression models can fall short when it comes to modeling unexpected trends in a dataset.
Gradient Boosted Trees
At this point, you may be thinking: I don’t care how these channels have performed in the past. I want to extract every bit of information out of my data so I can make the best choices going forward.
This is where gradient boosted trees (GBTs) shine. GBTs are extremely popular models that frequently outperform top-of-the-line deep learning models with very little configuration required. This is the type of model that powers Alight’s media mix model.
In essence, these models are fancy decision trees, like a game of 20 questions where you home in on the correct answer because each question helps weed out incorrect answers. GBT are basically stacks of decision trees where each tree’s output is used as input to the next tree.
Gradient boosted trees are very effective at learning more complicated patterns from the data and using those patterns to predict an outcome. It’s perfect for learning from past marketing spend to predict the results of new combinations.
The architecture of these trees makes it possible to quickly produce predictions, though training times are longer. This is the ideal tradeoff for a model designed to optimize budget allocation across marketing channels and produce the highest return on investment. If you’re responsible for optimizing marketing spend, the model will only be trained once, but it will need to produce numerous predictions during the optimization process. Here, faster predictions means faster optimization.
GBTs also maintain some level of interpretability. The gradient boosting algorithm does provide additional complexity when interpreting the value of each channel, but not enough complexity to make it a meaningless pursuit. While you won’t be able to achieve the high level of detail the Bayesian linear regression model provides, GBTs still provide a more loosely defined idea of the importance of each feature.
Again, it’s time to come to the bad news. The first negative point is what we just discussed. While you can get an idea of feature importance, it’s much more difficult to interpret when compared to Bayesian linear regression.
For example, GBTs naturally provide a measure of feature importance. Going back to the 20 questions analogy, the importance of a question can be measured by how much that question narrows down the field of possible answers. This is one way that feature importance in GBTs can be calculated. Unfortunately, while the GBT will tell you if a feature is important, it won’t tell you in what way the feature influences the prediction. It’s just as possible that the most important feature is important in decreasing sales as it is in increasing sales. Clearly, this isn’t a great attribute.
The other negative aspect of GBTs is potentially even more important. By their nature, GBTs are unable to extrapolate. (This shouldn’t be a huge surprise to any data scientists because extrapolation is one of the big questions in machine learning.) However, they are extremely proficient in modeling data within the boundaries of the training dataset.
But, provided a feature that takes a value higher or lower than anything in the training data, the model will struggle. You can imagine that the GBT has a maximum value that it will predict. Once you exceed the largest value each feature has taken in the training data, the model will continue to predict the exact same value no matter how much further you increase the feature values.
Let’s say you have two marketing channels: paid social and display. In your historical data, the most you’ve ever spent on paid social is $100 and the most you’ve ever spent on display is $200. If the model predicts you will see 100 sales with paid social spend of $100 and display spend of $200, then the model will also predict 100 sales if you spend $1,000,000 on each.
So these are the tradeoffs with gradient boosted trees:
- Feature importance is available, but not straightforward to understand.
- Faster prediction enables quicker optimizations, but slower training times.
- Excellent interpolation, but no extrapolation.
Deep Learning
GBTs may be a step in the right direction for you, but you might be concerned about the fact that they can’t extrapolate. Maybe you’re really looking to ramp up marketing and exceed the bounds of your historical data.
This is where you may have success with a deep learning model. I emphasize the word “may” because deep learning models tend to require large datasets to train on while the size of datasets for media mix modeling is typically quite small.
Deep learning models provide extremely limited extrapolation capabilities, but this is an unsolved problem across this space. (On the bright side, we’re just one breakthrough away from deep learning models that are capable of real extrapolation!)
The benefit of a deep learning model is that it will continue to respond to increases in feature values where a GBT simply will not. The drawback is that it’s even harder to interpret a deep learning model. Deep learning models currently are black boxes where you put in some outputs, get out some outputs and don’t really understand why you get what you get.
Interpretability vs. Performance
When selecting a media mix modeling technique, you’ll usually be choosing between interpretability and performance — a trade-off that you encounter in data science broadly.
Do you want to know why your model is making predictions and understand the importance of the features in your training data?
Or do you care more about building a model that understands deep patterns in the data, but struggles to expose those patterns to you in an understandable format?
The answer to these questions should guide your decision-making when you begin to implement a media mix model.
Ready-to-Use Media Mix Models
ChannelMix features built-in media mix models that even nontechnical marketers can use to plan upcoming campaigns. Schedule a call with our team to learn more!