Tuesday, December 21, 2010

MapReduce

"Easy distributed computing"

MapReduce is a framework introduced by Google for processing larges amounts of data.
The framework uses a simple idea derived from the commonly known map and reduce functions used in functional programming (ex: LISP). It divides the main problem into smaller sub-problems and distribute these to a cluster of computers. It then combines the answers to these sub-problems to obtain a final answer.

MapReduce facilitates the process of distributed computing making possible that users with no knowledge on the subject create their own distributed applications. The framework hides all the details of parallelization, data distribution load balancing and fault tolerance and the user basically has only to specify the Map and the Reduce functions.

In the process, the input is divided into small independent chunks. The map function receives a piece of the input, processes it, and passes the input in the format key/value pair as answer. These key/values are grouped in a certain way and given as input to the reduce function. This in its turn merges the values, giving the final answer.

Each map and each reduce may be processed by a different node (Computer) in the cluster. The quantity of nodes in the cluster may be as big as the availability of computers in your network. The framework is responsible for dividing the input and feeding the map function. Afterwards, it collects  map's outputs, group and send them to the reduce function. After the work of reduce if done, the framework gather the answers in a final output.

The following picture shows the MapReduce flow.


Example of Map and Reduce functions:

 This is a canonical example of the Map and the Reduce functions. It is an application to count the number of occurrence of words in a large collection of documents.

void map(String name, String document):
   // name: document name
   // document: document contents
   for each word w in document:
     EmitIntermediate(w, 1);


void reduce(String word, Iterator partialCounts):
   // word: a word
   // partialCounts: a list of aggregated partial counts
   int result = 0;
   for each pc in partialCounts:
     result += ParseInt(pc);
   Emit(result);

In this example, each map called receives a document and outputs the occurrence "1" to each word.
 The reduce function is going to  get a list with all the occurrences of a certain word and count it. Then as output it will give the total number of occurrences of that word.


Useful Links:


http://code.google.com/intl/pt-BR/edu/submissions/mapreduce-minilecture/listing.html
http://labs.google.com/papers/mapreduce.html
http://code.google.com/intl/pt-BR/edu/parallel/mapreduce-tutorial.html
http://code.google.com/intl/fr/edu/submissions/mapreduce/listing.html
http://www.mapreduce.org/
http://en.wikipedia.org/wiki/MapReduce

Monday, December 6, 2010

Datasets

I have been talking about recommender systems and data mining algorithms and a clear drawback in this area of research is the scarcity of datasets to work with. So here follows a list of open datasets available in the internet to be used as test data. The links below contain different types of data varying from implicit users web activities to explicit ratings that users have given to items. Note that I have simply gathered this data; I am just providing it here to facilitate the access.


This is a very known datasets provided by MovieLens. It is a set of explicit users ratings on items. It also contains information about the users and the items.
It provides 3 files with the .dat format.

Dataset with implicit and explicit user ratings on books.
It offers demographic information about the user as well. The files provided are mysql.

Various types of data provided by yahoo.

 Explicit ratings from a online joke recommender system. The file is in the .xls format.

Explicit users ratings from a dating agency.

Here we have web data from 3 sources.
• Microsoft: This dataset records which areas of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.
• Msnbc.com: Page visits of users who visited msnbc.com on September 28, 1999.
• Syskill and Webert: This database contains the HTML source of web pages plus the ratings of a single user on these web pages.

This data presents a real query log data from AOL. Ut is an implicit type of data.

 Here we have 800,000 search queries from end user internet search activities.

This set provides data records from a restaurant recommender system.

An implicit dataset with a day's worth of all HTTP requests to the EPA WWW server.

Here we are provided with an implicit dataset with two month's worth of all HTTP requests to the NASA Kennedy Space Center WWW server.