Posts

Showing posts from November, 2016

Running k-Means Clustering on Spark with Cloudera in your Machine

Image
Here are some steps to start using Spark. You can download a VirtualBox and a Cloudera Hadoop distribution and start testing it locally on your machine. Steps : Download kmeans.py example that uses MLLIB furnished by Spark. Create a kmeans_data.txt file that looks like this: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2 Download VirtualBox . Download Cloudera CDH5 trial version. Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it. Inside VirtualBox: 1 - (needs internet access) Install python numpy library. In a terminal, type: $ sudo yum install numpy 2 - Copy kmeans_data.txt and kmeans.py to /home/cloudera/ (or wherever you want) 3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command: $ sudo cloudera-manager --force --enterprise 4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that: user: cloudera password: cloud