Posts

Showing posts from 2016

Running k-Means Clustering on Spark with Cloudera in your Machine

Image
Here are some steps to start using Spark. You can download a VirtualBox and a Cloudera Hadoop distribution and start testing it locally on your machine. Steps : Download kmeans.py example that uses MLLIB furnished by Spark. Create a kmeans_data.txt file that looks like this: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2 Download VirtualBox . Download Cloudera CDH5 trial version. Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it. Inside VirtualBox: 1 - (needs internet access) Install python numpy library. In a terminal, type: $ sudo yum install numpy 2 - Copy kmeans_data.txt and kmeans.py to /home/cloudera/ (or wherever you want) 3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command: $ sudo cloudera-manager --force --enterprise 4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that: user: cloudera password: cloud

Error when using smooth.spline

When trying to interpolate a series of data the cubic spline  is a great technique to be used. I choose to use the smooth.spline function, from the R stats package. > smooth.spline(data$x,  data$y ) Nevertheless, while running smooth.spline on a collection of datasets with different sizes I got the following error: Error in smooth.spline( data$x,  data$y),  :   'tol' must be strictly positive and finite After digging a little bit I discovered that the problem was that some datasets were really small and smooth.spline wasn't being able to compute anything. Hence, make sure your dataset is big enough before applying smooth.spline to it. > if(length(data$x) > 30) {  smooth.spline( data$x,  data$y)  } UPDATE:  A more generalized solution would be: > if(IQR(data$x) > 0) {  smooth.spline( data$x,  data$y)  }