Tuesday, December 27, 2016

Getting Started with Regression in R

Regressions are widely used to estimate relations between variables or predict future values for a certain dataset.

If you want to know how much of variable "x" interferes with variable "y" you might want to do a regression in your data. If you have a bunch of data points in time, and you want to know what is your data going to look like in the future, you also might want to do regression. 

I will try to describe the steps that helped me successfully build linear and non-linear regression in R, using polynomials and splines. I am not going to go on too much details on each method. I just want to give an overall step-by-step on how to do a general regression with R, so that you guys can go further on your own.


First Steps: get to know your data


The first thing you should do is see what your data looks like. Plot the data, maybe try to get some statistics out of it, and try to understand what type of relation there is between variables.

There might be a linear (line) or non-linear (curvy) relation between your data points. The data in question might be dependent of only one variable or several variables.

Suppose you have the following dummy dataset:


x
y
1.1
1
2.1
4
3.1
10
4.1
16
5.1
25
6.1
36
7.6
52
8.1
64
9.1
81
10.1
100
11.1
121
13.1
138
13.1
169
14.1
196
15.1
225
16.1
256
17.1
289
18.1
324
19.1
361
18.1
400
21.1
441
22.1
484
23.1
529
24.6
574
25.1
625
26.1
676
27.1
729
29
789
29.1
841
30.1
900

Here we have one variable x that varies according to another variable y.

Plotting this data would get us the following graph

> plot(x, y, col="blue", main="Example Graph")
> grid(nx = 12, ny = 12, col = "lightgray", lty = "dotted", lwd = par("lwd"), equilogs = TRUE)


Visually, we can say that this data seems to follow a non-linear pattern. Even further, the relation between y and x seems to be a second degree polynomial. 

Models


The natural first thing to do is to check if a second degree polynomial would match our data. To create a model, we can use the lm() function. You should pass as parameter the equation you think might suit your data. Here we are actually "guessing" which model better fits to the data.

my_model <- lm(y ~ poly(x, 2))

In the example above we are creating a model where y=x^2.

Of course, you might want to test other alternatives:

my_model_linear <- lm(y ~ poly(x, 1))

Or (why not?)

my_model_degree_20 <- lm(y ~ poly(x, 20))

For more complex datasets, spline is a nice method to be used:

library(splines)
my_model_spline <- lm(y ~ bs(x))

 Here, bs is the base function. Use the parameters knots and df to make the function smoother or curvier.



We were lucky in our example with the second degree polynomial, but the idea here is to mess around a little with these functions and parameters, trying to find the best model possible.
 

Check Results


After you created some models, to visually check how they fit your data, you can plot your x values against the model values you created. Here we use the lines() method to do that:

lines(x, predict(lm(y ~ poly(x, 2))))







To further check how well your model fit your data, you can plot the model itself 

plot(my_model)


This is going to give you a bunch of information like the residuals against the fitted values. For more on that click here and here.

You can also use a t.test()  to see if the two groups (real versus modeled values) are similar. This test is going to compare their means, assuming they both are under a normal distribution.

t.test(y, predict(my_model))

Read more about the t.test() here.
 

Predict New Data



Now, suppose you were able to find a good function to model your data. With that, we are able to predict future values for our small dataset.

One important thing about the predict() function in R is that it expects a similar dataframe with the same column name and type as the one you used in your model.

For example:

my_prediction <- predict(my_model, data.frame(column_name = c(value_to_be_predicted))).  
If you had used dates in numeric form, for example you would have:

my_date <- "2016-05-10"
date_df  <- data.frame(x=as.numeric(as.Date(my_date))) my_pred <- predict(cubic_model, date_df)

In our example we used generic numbers  with the name "x". 

my_pred <- predict(my_model, data.frame(x = c(31.1)))


Links


http://www.dummies.com/how-to/content/how-to-predict-new-data-values-with-r.html
http://www.r-bloggers.com/splines-opening-the-black-box/
http://statweb.stanford.edu/~jtaylo/courses/stats203/R/inference+polynomial/spline.R.html
http://data.princeton.edu/R/linearModels.html
http://www.r-bloggers.com/first-steps-with-non-linear-regression-in-r/



OBS: If you only desire to interpolate your data, create a "line" between your data points, check the smooth.spline function. It will interpolate your data, and you don't have to keep guessing the relation between data.


OBS2: If your data function is complex, if you are not being able to model your dataset correctly or if you are just willing to try new stuff, Neural Networks can be a very powerful way of learning your data.