Tuesday, November 8, 2016
Running k-Means Clustering on Spark with Cloudera
While running these steps, errors might appear in some part of the process due to initialization timing issues. I know that is a annoying advice, but if that happens just try running the command again in a couple of minutes. Also, you have to change the location of the kmeans_data.txt file inside kmeans.py to point it to your data, and also maybe change where the output will be written (target/org/apache/spark/PythonKMeansExample/KMeansModel).
Download kmeans.py example that uses MLLIB furnished by Spark.
Create a kmeans_data.txt file that looks like this:
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
Download Cloudera CDH5 trial version.
Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it.
1 - (needs internet access) Install python numpy library. In a terminal, type:
$ sudo yum install numpy
2 - Copy kmeans_data.txt and kmeans.py to /home/cloudera/ (or wherever you want)
3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command:
$ sudo cloudera-manager --force --enterprise
4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that:
5 - Start HDFS on ClouderaManager Webinterface (on your browser)
6 - Start Spark on ClouderaManager Webinterface (on your browser)
7 - Put the kmeans_data.txt into HDFS. Run:
$ hadoop fs -put kmeans_data.txt
8 - Run the Spark job kmeans.py locally with 2 threads:
$ spark-submit --master local kmeans.py
7 - Get the result from HDFS, and put it in your current directory:
$ hadoop fs -get KMeansModel/*
8 - The result will be stored in parquet. Read the result with parquet-tools:
$ parquet-tools cat KMeansModel/data/part-r-000..
Here is an example output of what this command should give: