Posts

Showing posts from December, 2010

MapReduce

Image
"Easy distributed computing" MapReduce is a framework introduced by Google for processing larges amounts of data . The framework uses a simple idea derived from the commonly known map and reduce functions used in functional programming (ex: LISP). It divides the main problem into smaller sub-problems and distribute these to a cluster of computers . It then combines the answers to these sub-problems to obtain a final answer . MapReduce facilitates the process of distributed computing making possible that users with no knowledge on the subject create their own distributed applications. The framework hides all the details of parallelization , data distribution load balancing and fault tolerance and the user basically has only to specify the Map and the Reduce functions. In the process, the inp ut is divided into small independent chunks. The map function receives a piece of the input, processes it, and passes the input in the format key/value pair as answer. These k

Datasets

I have been talking about recommender systems and data mining algorithms and a clear drawback in this area of research is the scarcity of datasets to work with. So here follows a list of open datasets available in the internet to be used as test data. The links below contain different types of data varying from implicit users web activities to explicit ratings that users have given to items. Note that I have simply gathered this data; I am just providing it here to facilitate the access. http://grouplens.org/datasets/movielens/ This is a very known datasets provided by MovieLens. It is a set of explicit users ratings on items. It also contains information about the users and the items. It provides 3 files with the .dat format. http://www.informatik.uni-freiburg.de/~cziegler/BX/ Dataset with implicit and explicit user ratings on books. It offers demographic information about the user as well. The files provided are mysql. http://webscope.sandbox.yahoo.com/ Vario