Jan 12, 2018

In the last blog post, we showed how keywords can be used to better understand the consumer, and here we are going to explore if genres provide stronger results. In our study, we analyzed the distribution of genres in movies using a dataset of 4,000 movies.


Looking at the distribution of genres in the figure above, it is clear to see that the movies fall under the drama, comedy and action categories more frequently than any other genre. We used a supervised learning algorithm as described in Part II of the blog series

We used only movie genres as features to create the prediction model. With just the movie genres, we were able to predict gender with an accuracy of 76%. Also, for a selected 60% of these viewers, we can predict their gender with 80% confidence.

Since the dataset also included the rating given by each user for all the movies the user had watched, we testing using ratings for each movie title in an attempt to increase the accuracy of the model. With this addition, we were able to increase the accuracy of this model to 78% for gender prediction and 80% for age classification Also, for 90% of these cases, we can tell the gender with 80% confidence. For 30% of these cases, we can tell the gender with 90% confidence.

However, one main problem with this model was that if we had new movies in our dataset, we would have to re-train the model to incorporate the added titles for predictions. So, the previous model with only genres and subgenres would be applicable even if we had new movies in the dataset. This allows us to better understand users that watch movies that have not been rated and make predictions about movies before they are available.

Something that was surprising for us was that keywords of movies gave us little to understand about the viewer, whereas genres gave much greater insight - with an increase in accuracy by almost 16%.

To validate the reusability of the model, we tried the same model on another dataset that had 1 million users and had 8 million ratings. We filtered this dataset and used only those users who had rated at least 20 movies. We tried to use the predictive model on this dataset and there was a drop in accuracy by 4%. This gave us a proof that we can use our model and try to predict it on any other similar movie dataset.

Future Work:

As a next step, we would like to make use of the movie posters, actors, directors and more for demographic predictions. We also plan to create models to improve recommendations, target advertisement and make churn predictions. However, the mixed household challenge is an important issue to address. For this, we plan to leverage personal devices and identify a household profile rather than individuals. We would also like to apply these models to more datasets to apply predictions and improve the model. If you have data, let’s talk – share your insights in the comments section below.