networks & systems laboratory> research> current projects> content-based movie recommendations from categorization

Content-Based Movie Recommendations from Categorization
Smart Internet Technology Research Group

Aims

- Produce an online movie recommendation system based on learning for text categorization.

- Evaluate the performance of text categorization techniques over the domain of movie synopses.

- Compare with collaborative filtering and feature based movie recommendations.

Introduction

- Recommendation systems are a novel sales tool that help both retailers and consumers in achieving their goals. Recommendation systems can help consumers find items that they are looking for or help them with suggesting items that they are likely to be interested in. Retailers can benefit from recommendation systems as such systems have the capability of increasing product exposure via the recommendations generated by the system.

- The domain chosen for this recommendation system is the realm of movies. The recommendation system uses movie synopses and performs text categorization of movies into categories corresponding to a user’s preferences and dislikes. The movie synopses are extracted from the “Internet Movie Database” (IMDB) at http://www.imdb.com. They are user contributed synopses where the style of writing is informal and fairly verbose in comparison OHSUMED and flavours of Reuters Newswire subsets which are popular domains used in research for text categorization. This research project is to apply techniques used in text categorization and evaluate their performance over a structurally different domain. The nature of domain means that we should expect different results and more tolerance to aggressive feature reduction.

- The implementation will focus on heavy feature reduction to reduce the dimensionality of the text categorization task. This is motivated by the online aspect of this project where user waiting times for a web request should be kept at a minimum.

- This is work is an extension of R. J. Mooney’s LIBRA book recommendation system from text developed at the University of Texas, USA. The Intelligent Movie Recommender experiments with numerous text categorization parameters and differs in the domain in which it categorizes the synopses.

- This is also an extension of a group of third year Computer Science students at the University of Sydney, Australia. The group “eliteAI” uses the same source as the Intelligent Movie Recommender and their work may be used as a basis of comparison with other recommendation techniques.

Data Representations

- The movie synopses are represented in three different feature types which includes: a bag of words representation; a bag of nouns; and a noun phrase representation.

- The bag of words representation treats each example synopses as an unordered vector of all the words which comprise it. Words are excluded if they appear on a Stop-word list containing common words which are considered non-informative. Suffixes from words are removed to reveal their stem forms.

- The bag of nouns representation treats each example synopses as an unordered vector of all the nouns which comprise it. The nouns are subject to exclusion if found on a Stop-word list, but are not stemmed as they are in the bag of words representation. There are two motivations for this approach:

- Nouns are considered to be more valuable than other parts of speech because they indicate what the document is about. In other domains, other parts of speech may be just as important. Verbs like “sell” and “buy” in a financial context would be deemed important: but the domain of movies is more focused on things, people and events. This assumption will be evaluated by comparing this representation with the bag of words representation which includes all other parts of speech.

- Nouns are only a fraction of the words in a sentence. The feature space in a bag of nouns representation would be only a fraction of the feature space of the bag of words representation.

- The noun phrase representation treats each example synopsis as an unordered vector of noun phrases. The motivation behind this approach is that the noun phrase gives context to the nouns in a sentence. The feature space of the noun phrase representation is less than the feature space of the bag of words representation but appears to be much sparser than both the bag of nouns and bag of words representation. This is because any difference between two similar noun phrases would represent two completely different features. For example “exceedingly rich king” and “exceedingly rich and eccentric king” would be treated as two distinct features in the noun phrase representation whereas a bag of nouns representation would map them both to the same feature: “king”.

- Figure 1 shows the comparative sizes of each data representation with respect to the number of distinct terms.

Feature Selection

- The user profiles are subject to feature selection to reduce the dimensionality of the problem. The system implements three different feature selection techniques used in text categorization: Information Gain, Document Frequency Thresholding and Mutual Information.

- Document Frequency Thresholding and Information Gain favour common terms and Figure 2 shows a significant reduction in the index size compared to a representation without feature selection.

- Mutual Information does not favour common terms and Figure 2 shows a similar index size to the representation without feature selection.

Methods for Classification

- The recommendation system classifies movie synopses via various machine learning algorithms. Classifiers include a Naïve Bayes’ Classifier, a k-Nearest Neighbour Classifier and Decision Trees.

- The employment of various machine learners is to examine how different learners perform over the movie synopsis domain.

- Decision Trees are included in the system on account of the way that it structures the decision process. An extension for this project would be to translate this data into explanations for the user.

System Description

Building a Profile:
- There is a considerable amount of pre-processing each time a user makes a new rating. Feature selection is performed over the user profile to cater for new terms that are not currently in the user profile. Each feature selection technique for each data representation is stored in the “User Profile Storage”. Figure 3 illustrates the process of building and storing a user profile.

Machine Learning:
- WEKA, a Machine Learning Algorithm Library from the University of Waikato, New Zealand powers all the classifiers used by the system. The WEKA library learns user profiles and stores the machine states for later classification of unseen examples for recommendation.

Recommending a Movie:
- When a user requests a recommendation, the system loads the trained classifier for any one of the three Machine Learning Algorithms with varying parameters. A recommendation is then generated by the selected trained classifier by classifying an unseen example. Figure 4 illustrates this recommendation process.

Online Interface:
- The user interface is a modified eliteAI online movie recommender interface which includes classification using the mentioned text categorization techniques. This interface allows users to obtain recommendations from both systems.

*click image to enlarge

Contact

Harry Mak
Dr Irena Koprinska

Dr Josiah Poon

 
University of SydneyDesigned by eliu