Predicting Viewer Demographics Part I: Consumption History
Big data is an obvious trend in any industry today, and TV is no exception. The question is of course, how we can get knowledge out of data. The FANG (Facebook, Amazon, Netflix, Google) companies lead the way in collecting, analyzing and monetizing data, and the information they gather is detailed as well as broad. Such data has the potential to benefit video service providers (VSPs) in a multitude of ways. For example, VSPs that can identify users with low movie consumption can target them specifically with an incentive that would encourage increased viewing habits. Data with demographics provided can be used to identify movies that are popular with both genders and across age groups, which for example, would be very useful in helping VSPs identify selections that suit a wider audience.
At Verimatrix, we have started monitoring applications in the TV domain ourselves, and we have ventured to answer a basic question: What assumptions can we make about viewers based solely on their movie consumption data? In Part I of this three-part series, we take a look at the types of generalizations we can make by exploring the connection between demographics and movie consumption.
We started our research by classifying users in terms of gender and age. While individual subscriber viewing history may seem like an obvious starting point, in actuality, it is a fairly unspecific piece of information if there aren’t any demographics indicated. From there we were able to categorize users based on the amount of their consumption.
Through our research, we were able to develop a formula using the data available to predict the age and/or gender of a subscriber and tested its accuracy using a set of previously unseen data. The dataset used was aggregated from 6,000 viewers who had watched a combined total of 4,000 different movies across 18 genres. Each user had rated at least 20 movies on a scale of 1 to 5, and each provided a variety of demographic information such as gender, age, zip code and occupation.
The image above displays the posters for what we found to be some of the most discriminating movies between females (left) and males (right), while the word cloud below prominently identifies each gender's most popular genres. There is a clear indication of genre preference, with males watching a lot of action movies and females more frequently watching movies that deal with friendship and romance. While these findings aren’t groundbreaking, they are consistent with what one would assume – and that is helpful in verifying the data.
Next, we analyzed the posters of movies watched by different age groups. The image below shows the top watched movies by every age group, and it is quite evident that those less than 18 years of age watch lot of movies that star children and feature animals. People in the age group of 18-35 watch mostly young, heroic, happy movies, while people with more than 50 years of age have a tendency to watch older movies.
From the dataset, we can also see when movies were rated, and we found that some months received a significantly greater number of ratings, as shown below. We can speculate about the reason, such as vacation times, but this is a good example of anomalies in the data that can be hard to explain. There isn’t always conclusive evidence to explain these spikes.
Next, we were able to identify the keywords that were most frequently watched by each gender. As we can observe in the word cloud below, men watched movies which had keywords like “military,” “human-alien,” “gun,” etc., denoting more action and sci-fi movies. Conversely, some of the keywords associated with the movies that women watched were “1810s,” “bouquet,” “Cinderella,” etc., denoting fantasy and classic movies.
However, this dataset has some data limitations. The demographics and ratings were self-reported and there were lots of old movies from more than 20 years ago. The date the movie was rated is assumed to be the movie consumption date, but the consumption data ratio between genders is not balanced. Another key consideration is that we can’t be certain who within the household rated which movie. For these reasons, it is important to keep in mind that any dataset has its own set of limitations.
Up Next: In Part II of the series, we share our analyses made using movies’ meta information and reveal what can be learned from this data. We’ll also look into what information has most predictive power to understand the users.