Tuesday, November 23, 2010

Slope One

Slope One is a simple and efficient type of recommender system. Introduced by Daniel Lemire and Anna Maclachlan in 2005, it involves a simpler idea than the majority of other collaborative filtering implementations. While these usually calculate the similarity between vectors of items using the cosine or the Pearson methods, the Slope One approach recommends items to users based on the average difference in preferences of items. 

The main idea of the algorithm is to create a linear relation between items preferences such as the relation F(x) = x + b. The name "Slope One" cames from the fact that here the "x" is multiplied by "1".

It basically calculates the difference between the ratings of items for each user (for every item the user has rated). Then, it creates and average difference (diff) for every pair of items. To make a prediction of the Item A for an User 1 for example, it would get the ratings that User 1 has given to other items and add to it the difference (diff) between each item. With this we could obtain an average.

Being R1B the rating that user 1 has given to item B, DiffAB the difference between ratings of item A and B, and supposing we have 10 items:


Prediction(Item A) for User 1  = (R1B + DiffAB) + (R1C + DiffAC) + ... (R1J + DiffAJ)
                                                                                10

Example

Suppose that we have a matrix of ratings from users to items. We have an user A that has given an item I the rating 1 and item J the rating 1.5. We also have an user B that has given item I the rating 2.
Based on that, how would slope one calculate the rating from user B to item J?



The algorithm would say that there is a linear relation between the differences of the items. So if the  preference difference of items I and J by user A is 0.5, we would apply this difference to the rating user B has given to I and obtain a rating to J.
Basically , we have

Pref(J) by user B = Pref (I) + Diff I J



Another example:

The prediction from user A to item 3 would be:

  Diff (item 3 x item 1) = 3 – 4 => - 1
  Diff (item 3 x item 2) = 3 – 3 => 0

  3 is the rating given by user A to item 1 and 5 is the rating given to item 2. So we add his given ratings to the diffs.


P(item 3) = [(- 1 + 3) + ( 0 + 5) ] / 2 =>  3.5

Pseudo-Code

Below I present a pseudo-code version of the algorithm. It can be divided in two parts:

   The preprocessing phase, in which is calculated the
   difference between all item-item preference values

1. for every item i
2.  for every other item j
3.   for every user u expressing preference for both i and j
4.     add the difference in u’s preference for i and j to an average

   The prediction part

1. for every item i the user u expresses no preference for
2.  for every item j that user u expresses a preference for
3.   find the average preference difference between j and i
4.   add this diff to u’s preference value for j
5.   add this to a running average
6. return the top items, ranked by these averages


Complexity

In terms of execution, the most expensive phase is the preprocessing one, which can be precomputed. The algorithm is very attractive because its online portion, the prediction phase, is fast.
Its running time does not depend on the number of users, it depends mostly upon the average preference difference between every pair of items. Suppose we have m users and n items, computing the average differences could take up to m n2 time steps. The storing of the diff matrix could also be expensive. It could take up to n(n-1)/2 units of storage.

Useful links:


Thursday, November 4, 2010

Apache Mahout



“Scalable machine learning library”

Mahout is a solid Java framework in the Artificial Intelligence area. It is a machine learning project by the Apache Software Foundation that tries to build intelligent algorithms that learn from some data input.

What is special about Mahout is that  it is a scalable library, prepared to deal with huge datasets. Its algorithms are built on top of the Apache Hadoop project and, so, they work with distributed computing.

Mahout offers algorithms in three major  areas: Clustering, Categorization and Recommender Systems. This lats part was incoporated in April 4th 2008, from the previous Taste Recommender System project.

Mahout currently implements  a collaborative filtering engine that supports the user-based, item-based and Slope-one recommender systems. Other algorithms available in the package are  the k-means, fuzzy k-Means clustering, Canopy, Dirichlet and Mean-Shift. They also have The Naive Bayes, Complementary Naive Bayes and Random forest decision tree based classifier.

The project present a commercial friendly license (meaning that modifications in the code can be kept proprietary)  and a vast community of users and developers, so for that, it is highly recommended!

a. Taste
Taste is the Recommender System part of Mahout and it provides a very consistent and flexible collaborative filtering engine. It supports the user-based, item-based and Slope-one recommender systems. It can be easily modified in due to its well-structured modules abstractions. The package defines the following interfaces:

  • DataModel
  • UserSimilarity and ItemSimilarity
  • UserNeighborhood
  • Recommender


With these interfaces, its possible to adapt the framework to read different types of data, personalize your recommendation or even create new recommendation methods. Below is presented a figure with Taste's architecture.
The User Similarity and Item Similarity abstractions are here represented by the box named “Similarity”. These interfaces are responsible for calculating the similarity between a pair of users or items. Their function usually returns a value from 0 to 1 indicating the level of resemblance, being 1 the most similar possible.
Trough the DataModel interface is made the access to the data set. It is possible to retrieve and store the data from databases or from filesystems (MySQLJDBCDataModel and FileDataModel respectively). The functions developed in this interface are used by the Similarity abstraction to help
computing the similarity.
The main interface in Taste is Recommender. It is responsible for actually making the recommendations to the user by comparing items or by determining users with similar taste (item-based and user-based techniques). The Recommender access the similarity interface and uses its functions to compare a pair of users or items. It then collect the highest similarity values to offer as recommendations.

The UserNeighborhood is an assistant interface to help defining the neighborhood int the User-Based recommendation technique.
It is known that for greater data sets the item-based technique provides better results. For that, many companies choose to use this approach, such as Amazon. With the Mahout framework its not different, the item-based method generally runs faster and provides more accurate recommendation.

b. Installation

Here follow a step-by-step guide to install and test the Mahout recommender system.
Firstly, is necessary to have the project manager Maven. It is very easy to use it. It runs simply by calling the “mvn install” command. This command will compile the code and download missing packages. Maven uses a “pom.xml” file for configuration, but the Mahout project already comes with this file.
To test Taste it possible to get a data set from MovieLens. It is package with 3 files in the .dat format, presenting users, items, ratings and some information on users and items.


Pre-installation

1. Make sure you have at least the Java JDK 1.6

$ javac -version
javac 1.6.0_26

2. Make sure you have the project manager Maven installed in your computer.

$ mvn -version
Apache Maven 2.2.1 (rdebian-1)

3. Download a Hadoop version, from http://www.apache.org/dyn/closer.cgi/hadoop/common/. I downloaded: archive.apache.org/dist/hadoop/core/hadoop-0.20.204.0/ , which is the version required by Mahout 0.7 on pom.xml (check hadoop.version property).

Mahout Installation

1. Download the Mahout package:
https://cwiki.apache.org/confluence/display/MAHOUT/Downloads

I downloaded http://ftp.unicamp.br/pub/apache/mahout/0.7/mahout-distribution-0.7-src.tar.gz version (for development purposes it is even better to download with svn co http://svn.apache.org/repos/asf/mahout/trunk).

2. Unpack the Mahout downloaded package

$ tar -xvzf mahout-distribution-0.7-src.tar.gz

3. In a terminal, change to the mahout directory and compile the code using Maven :

$ cd mahout-distribution-0.7/
$ mvn install

With this, you will have compiled Mahout's code, and run the UnitTests that comes with it, to make sure everything is ok with the component.

If the build was sucessfull you should see something like:

[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 54 minutes 28 seconds
[INFO] Finished at: Tue Jul 09 11:17:08 BRT 2013
[INFO] Final Memory: 70M/375M
[INFO] ------------------------------------------------------------------------



 Executing Taste Recommender

1. Get your data on the following format:

userid, itemid, rating

For example, copy the following data and name it as mydata.dat :

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0
   

2. Make sure you set your JAVA_HOME, HADOOP_HOME, for example:

$ export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk/
$ export HADOOP_HOME=/home/myuser/Downloads/hadoop-0.20.204.0/

$ export MAHOUT_HOME=/path/to/mahout-distribution-0.7

And put them on your PATH:


$ export PATH=$HADOOP_HOME/bin:$PATH
$ export PATH=$MAHOUT_HOME/bin:$PATH

3. Now, run:

$ bin/mahout recommenditembased --input mydata.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
 

The usersFile is where you should put for which users you want to o the recommendation for. You can change numRecommendations to the number of recommendations you desire.

Also, on similarityClassname, you can choose anyone you like from the above list:

  • SIMILARITY_COOCCURRENCE
  • SIMILARITY_LOGLIKELIHOOD
  • SIMILARITY_TANIMOTO_COEFFICIENT
  • SIMILARITY_CITY_BLOCK
  • SIMILARITY_COSINE
  • SIMILARITY_PEARSON_CORRELATION       
  • SIMILARITY_EUCLIDEAN_DISTANCE

Another Example

You can also test it with some real dataset, for example, the one from MovieLens.
1. Download a copy of the “1 million” data set from MovieLens:

2. Unpack the MovieLens data ( you will find 3 files: movies.dat, ratings.dat and

users.dat)
3. Edit the ratings.dat file, so that it is in the 
userid,itemid,rating format, and not on  userid::itemid::rating format.

3. Copy the movies.dat and ratings.dat to your Mahout directory.
4. Run:

$ bin/mahout recommenditembased --input ratings.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
  
c. Your Own Examples

To create your own Recommender System example, it is necessary to construct a Java Application defining which approach it's going to be used.
If you chose, for example, the Slope-One technique, the code may be:
 DataModel model = new FileDataModel(new File("data.txt"));
 Recommender recommender = new
 SlopeOneRecommender(model);
 Recommender cachingRecommender = new
 CachingRecommender(recommender);

This code gets the data input called “data.txt” and passes it to the Recommender
interface. It then provides a recommendation to the user based on the Slope-one technique.
It is possible to construct similar examples, using other approaches such as a combination of a User-based method with the Pearson Correlation Similarity, or an Item-based approach with the Log Likelihood Similarity measurement.


d. Modifying

To modify the Mahout framework, it is advisable to focus in the interface that you are willing to change. The implementations of the interfaces resides in a folder inside the Mahout package called “impl” . So it would be necessary to check which interface is going to be modified and then take advantage of mahout's structure.
If it is desired, for example, to add a new type of recommendation, it would be simply necessary to add another Java file in the Recommender interface implementation, and call it in your final Java application.

 e. Links

http://mahout.apache.org/
http://www.ibm.com/developerworks/java/library/j-mahout/
http://www.manning.com/owen/
http://blog.jteam.nl/2010/04/15/mahout-taste-part-two-getting-started/
VIDEO on how to use Mahout on Eclipse: https://www.youtube.com/watch?v=yD40rVKUwPI