In Part I of this blog series, we looked at statistics of movie consumption and rating data to get an idea how they relate to age and gender. For Part II we want to go deeper and not only understand the data, but use it for prediction.
Based on our observations of movie keywords, we thought they may give more information to predict demographics. In our dataset, there were more than 20,000 keywords for all the movies, so we began to track their frequency and found that more than 75% of keywords were associated with less than five movies. Using a sample size of the top 4,000 keywords for our analysis, we identified the most prominent keywords based on their frequency - they are displayed in the graphic below.
Next, we analyzed the most popular keywords for both genders. Certain words appeared only in movies watched by males and certain by only females, and there were also some keywords that were triggered mostly by males and rarely by females and vice versa. This led us to believe that we could use certain keywords to predict the gender of a user.
We used a supervised learning model for classifications—a process in which an algorithm learns from a dataset and iteratively makes predictions. False predictions are corrected, and the learning process stops when the algorithm achieves an acceptable level of performance.
For this approach, 4,000 keywords was too large of a sample size to predict demographics, so we decided to reduce that to the 100 most distinguishing keywords and were able to do so using a Principal Component Analysis (PCA) algorithm. This algorithm is defined in such a way that the first principal component has the highest possible variance and only the most important features are extracted.
Using these features, we were able to predict the gender of a user with 53% accuracy. We used the same feature set but applied it on different algorithms with which we later predicted the gender of the user with 62% accuracy. Though we expected keywords to be the most important feature in classification, the results we obtained were a little surprising since we hoped for a more accurate model. Perhaps some keywords that were too specific, like “sister-sister-relationship,” and therefore weren’t very effective in predictions. Another reason that we speculate is the combination of keywords from all movies that a user watched created too much noise and reduced the accuracy of the model.
Similarly, we analyzed the keywords pertinent to the age groups younger and older than 25 years of age We found that some of the keywords that are most prominent to the young category relate to high school and other themes for youthful movies, and even more interestingly, keywords that are most prominent among the older age group are not pertinent to any specific type of movies. As a next step, we attempted to classify users as young or old and achieved an accuracy of 72% in predicting this.
Ultimately, we discovered that keywords give us a strong enough understanding about viewer demographics to make predictions.
As a next step, we evaluated the performance of genres in prediction models and found a significant increase in performance as outlined in the 3rd article of this series.