Vova's Blog Homepage

Hi, I’m Vladimir Eremin, I am a data scientist during the day and sometimes at night as well. Welcome to my blog, where I share stories hidden behind data. Here, I explore the realms of data science, quantitative finance, and problem-solving. This blog has started as my offline journal and and some point I decided to start publishing in the hope it may be helpful for others as well. Please look around if something catches your eye. Feel free to reach out to me if you have questions or suggestions. Additionally, for those of you in NYC, I co-host a data science meetup - if you’re around, I’ll be happy to meet you at our next event!

View My GitHub Profile

5 October 2023

Multi-Arm Bandits for Recommendation Systems

by Vladimir


[mathjax]

Envision yourself as a store owner looking to boost foot traffic. You hatch a plan: stand at the front and actively promote your items to those walking by. With enthusiasm, you quickly wear a striking outfit and hold up a blank board, ready to publicize a special deal. However, when it's time to pen the actual offer, you're stumped. A "10% discount on everything" could work, but you're torn between it being too small to matter or too large to remain profitable. A flat "$2 off your initial buy" crosses your mind, but could this be overly generous for just any passerby? Then the idea of "Free gift with every purchase" springs forth. It's intriguing, but will it genuinely enhance the day's revenue?

You quickly realize that what you truly need is some sort of method, capable of predicting the most appealing offer for each potential customer.
Welcome to the intricate and intriguing realm of recommendation systems, where algorithms help to match the right offer to the right customer, improving your chances of conversion and sales.

Recommendation systems is all about data

When we talk about recommendation systems, we talk big data! Abundance of data ensures a higher level of confidence about user preferences. In contrast, when there is limited data, there is a lack of certainty about what users might prefer. However, even in the face of such uncertainty, recommenders tend to prioritize and suggest items that have received a higher level of engagement previously.

For instance, with collaborative filtering, a most popular method typically used by recommender systems, users are matched based on their past behaviors or ratings. A user may be recommended a movie that other similar users have enjoyed.

Meanwhile, item-based similarity approaches would recommend items that are similar to ones the user has previously interacted with or rated highly. For instance, if a user liked a certain science fiction book, they would be recommended other books in the same genre or by the same author.

However, these methods can contribute to a feedback loop, where items that aren't initially recommended continue to receive low to no engagement. This results in a cycle where those items that were initially popular and recommended continue to get promoted, while potentially relevant items get ignored. This might lead to the problem of overfitting, where only a limited set of items get recommended over and over, while a large variety of other potentially interesting items are not discovered or recommended. As a result, the system fails to explore the breadth of user preferences, limiting the overall user experience.

Mastering Multi-Arm Bandits

There are several approaches that can solve this type of overfitting, the Multi-Arm Bandits is one of the most powerful and commonly used ones. Bandit algorithms propose a solution to this feedback-loop problem by modeling uncertainty and intentionally exploring to minimize it. This strategy embodies the concept of 'exploration vs exploitation', where 'exploitation' is using the knowledge we already have to recommend the most engaging items, while 'exploration' involves suggesting less-known items to gather more information about user preferences.

By acknowledging the inherent uncertainty in the data, bandits actively engage in exploration to reduce it, thereby learning about the potential relevance of previously unexplored or underexplored items. For instance, instead of continually recommending the best-performing item (a best-selling menu item, for example), a bandit algorithm might occasionally recommend a less popular item (perhaps a new item or previously unpopular). This explorative approach provides an opportunity to discover whether users might like this less-known item, thereby expanding the scope of recommendations.

Moreover, bandits incorporate a system of rewards, where a successful recommendation (like a user clicking on a recommended item) garners a 'reward'. Over time, these rewards inform the algorithm about the likelihood of different items being appreciated by different users, further refining the recommendation process. This harmonious blend of uncertainty, exploration, and learning makes bandit algorithms a compelling approach to mitigating the issues prevalent in the realm of recommendation systems.

Reward is usually denoted as $latex R_t $

Every time an action is taken a reward is obtained. Each time an action is taken the reward returned can have a different value.

Below is few more definition that are used to define the Bandit model:

The most simple example of the reward function is a Sample Average Estimates:

$$Q_{t+1}(a) = \frac{1}{n} \sum_{i=1}^{n} R_i$$

Choosing The Right Bandit

Choosing different design of the reward function yield different types of Bandits.

In the following sections, we will touch upon three fundamental bandit algorithms: epsilon-greedy, Upper Confidence Bound (UCB), and Thomson Sampling.

  1. Epsilon-Greedy: This algorithm operates by using a probability $latex ε$ to determine if it will explore or exploit. So, the reward function could be represented like:

$$R_t = \begin{cases} \text{argmax}_{a} Q_t(a), & \text{with probability } 1 - ε \\ \text{a random action}, & \text{with probability } ε \end{cases}$$

Where

  1. Upper Confidence Bound (UCB): The UCB algorithm adds a confidence interval to the estimated reward for each action. The action chosen is the one with the highest upper confidence bound. This could be represented like:

$$A_t = \text{argmax}_{a} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$$

Where $latex A_t$ is the action chosen at time $latex t$, $latex Q_t(a)$ is the estimated value of action $latex a$ at time $latex t$, $latex c$ is a constant determining the level of exploration, $latex N_t(a)$ is the number of times action $latex a$ has been chosen up to time $latex t$, and $latex l_n$ is the natural logarithm.

  1. Thomson Sampling.

In this algorithm, an action is chosen based on the highest sample from each action's reward distribution. This could be represented like:

$$A_t = \text{argmax}_{a} \left[ \text{Sample from } β(R_{1:t-1}(a), T_{1:t-1}(a)) \right]$$

Where $latex A_t$ is the action chosen at time $latex t$, $latex β$ represents a Beta distribution (commonly used in Thompson Sampling when rewards are binary), $latex R_{1:t-1}(a)$ is the sum of rewards for action $latex a$ up to time $latex t-1$, and $latex T_{1:t-1}(a)$ is the total number of times action $latex a$ has been chosen up to time $latex t-1$.

Our primary focus will be on Thomson Sampling, as it frequently surpasses both epsilon-greedy and UCB algorithms in performance. For readers interested in delving deeper into each algorithm and exploring a comprehensive comparison between them, we've included pertinent links in the reference section.

Data For Bandits

Returning to our shopkeeper scenario, after acquiring a solid understanding of multi-arm bandits and different reward functions, our shopkeeper resolves to employ the Thompson Sampling technique to identify the optimal offer for each of his patrons. His store, being somewhat removed from the bustling main street, sees a regular flow of three individuals passing by daily - we'll denote these as Person $latex P_1$, $latex P_2$, and $latex P_3$.

For each individual $latex P_i$ (where $latex i$ ranges from 1 to 3), there are five potential offers (or actions) that the shopkeeper could extend, denoted as $latex a_1, a_2, \ldots, a_5$. Notably, he also posits that not all individuals may necessarily need an offer to visit the store. To verify this hypothesis, he decides to occasionally abstain from extending any offers at all, thereby establishing a control observation for his experiment.

With each passing day, the shopkeeper observes the interactions of each individual with the varying offers (including the no-offer scenario), aiming to discern the optimal action $latex A_i$ for each person $latex i$. This optimal action would not only encourage a higher frequency of store visits (improving conversion rates), but also exhibit an uplift over the control scenario of no offers. In doing so, the shopkeeper can ensure his marketing campaign is cost-efficient, only extending offers when they demonstrably enhance store visitation.

Defining The Learner

With a clear strategy in mind, the shopkeeper set forth on a two-month experiment. Each day, he displayed a distinct offer (or no offer at all) to each of the three individuals, and meticulously recorded the outcome or "reward" ($latex R$) as either a conversion (1) or non-conversion (0):

$$R_{t,i,a} =
\begin{cases}
1 & \text{if conversion} \\
0 & \text{if non-conversion}
\end{cases}$$

In this formula, $latex t$ denotes the time step or day index, $latex i$ signifies the person, and $latex a$ indicates the offer index.

The shopkeeper then updated the parameters of the Beta distribution for each person's offer preference using the following rules:

Here, $latex \alpha_{t,i,a}$ and $latex \beta_{t,i,a}$ denote the parameters of the Beta distribution for person $latex i$, action $latex a$ at time step $latex t$.

These updates provided the shopkeeper an estimated conversion probability $latex P_{i}(X|a)$ for each individual and offer. This probability followed a Beta distribution:

$$P_{i}(X|a) \sim \text{Beta}(\alpha_{t,i,a},\beta_{t,i,a})$$

Finally, to decide which offer to show next, the shopkeeper chose the action $a$ that yielded the maximum expected reward. He used the following strategy to select the action:

$$A_{t+1,i} = \text{argmax}_{a} \left[ \text{Sample from } \text{Beta}(\alpha_{t,i,a}, \beta_{t,i,a}) \right]$$

Running the Model

After two months of diligent experimentation, the shopkeeper had amassed a rich collection of data for each passerby, shedding light on their distinct preferences. The shopkeeper meticulously recorded a sequence of observations:

       
customer_id day_index offer_index success
0 0 0 0
1 0 0 0
2 0 0 1
0 1 1 0
1 1 1 0
2 1 1 1

This table delineates the data he recorded. The day_index column begins from 0, representing the first day of the experiment. The customer_id could be either 0, 1, or 2, indicating the individual to whom an offer was presented. The offer_index reflects the specific offer shown - ranging from 1 to 5, with '-1' indicating a control experiment where no offer was presented. Finally, the success column is a binary marker showing whether the individual visited the shop after seeing the offer, with '1' indicating a visit and '0' representing no visit.

Armed with this data, the shopkeeper dusted off his seldom-used laptop and entered these records into a Jupyter Notebook to analyze and visualize the results. His analysis culminated in a series of animated plots demonstrating the convergence of the Beta distribution for each individual over time.

Conclusion

In conclusion, the shopkeeper's innovative application of Thompson Sampling provided him with invaluable insights. It empowered him to tailor his marketing strategy for each individual customer, enhancing efficiency and cost-effectiveness.

By mapping customer responses to offers, he could discern the impact of different marketing efforts. This data-driven approach enabled more effective customer segmentation. Instead of a one-size-fits-all strategy, he could now tailor offers based on individual preferences, optimizing the value of his marketing budget.
Moreover, he could identify customers who remained unaffected by offers, thereby avoiding unnecessary marketing expenditure. On the flip side, he recognized customers who would purchase irrespective of promotions, saving resources on unneeded incentives.

In essence, the shopkeeper's strategic use of Thompson Sampling refined his marketing from broad-strokes to precision-guided. This case underscores the potential of machine learning algorithms in enhancing marketing effectiveness and customer segmentation while ensuring optimal return on marketing investment.

References

Originally published on Grid Dynamics Blog


tags: