If you want to know how much of variable "x" interferes with variable "y" you might want to do a regression in your data. If you have a bunch of data points in time, and you want to know what is your data going to look like in the future, you also might want to do regression.
I will try to describe the steps that helped me successfully build linear and nonlinear regression in R, using polynomials and splines. I am not going to go on too much details on each method. I just want to give an overall stepbystep on how to do a general regression with R, so that you guys can go further on your own.
First Steps: get to know your data
The first thing you should do is see what your data looks like. Plot the data, maybe try to get some statistics out of it, and try to understand what type of relation there is between variables.
There might be a linear (line) or nonlinear (curvy) relation between your data points. The data in question might be dependent of only one variable or several variables.
Suppose you have the following dummy dataset:
x

y

1.1

1

2.1

4

3.1

10

4.1

16

5.1

25

6.1

36

7.6

52

8.1

64

9.1

81

10.1

100

11.1

121

13.1

138

13.1

169

14.1

196

15.1

225

16.1

256

17.1

289

18.1

324

19.1

361

18.1

400

21.1

441

22.1

484

23.1

529

24.6

574

25.1

625

26.1

676

27.1

729

29

789

29.1

841

30.1

900

Here we have one variable x that varies according to another variable y.
Plotting this data would get us the following graph
> grid(nx = 12, ny = 12, col = "lightgray", lty = "dotted", lwd = par("lwd"), equilogs = TRUE)
Visually, we can say that this data seems to follow a nonlinear pattern. Even further, the relation between y and x seems to be a second degree polynomial.
Models
The natural first thing to do is to check if a second degree polynomial would match our data. To create a model, we can use the lm() function. You should pass as parameter the equation you think might suit your data. Here we are actually "guessing" which model better fits to the data.
my_model < lm(y ~ poly(x, 2))
In the example above we are creating a model where y=x^2.
Of course, you might want to test other alternatives:
my_model_linear < lm(y ~ poly(x, 1))
Or (why not?)
my_model_degree_20 < lm(y ~ poly(x, 20))
For more complex datasets, spline is a nice method to be used:
library(splines)
my_model_spline < lm(y ~ bs(x))
Here, bs is the base function. Use the parameters knots and df to make the function smoother or curvier.
We were lucky in our example with the second degree polynomial, but the idea here is to mess around a little with these functions and parameters, trying to find the best model possible.
Check Results
After you created some models, to visually check how they fit your data, you can plot your x values against the model values you created. Here we use the lines() method to do that:
lines(x, predict(lm(y ~ poly(x, 2))))
To further check how well your model fit your data, you can plot the model itself
plot(my_model)
This is going to give you a bunch of information like the residuals against the fitted values. For more on that click here and here.
You can also use a t.test() to see if the two groups (real versus modeled values) are similar. This test is going to compare their means, assuming they both are under a normal distribution.
t.test(y, predict(my_model))
Read more about the t.test() here.
Predict New Data
Now, suppose you were able to find a good function to model your data. With that, we are able to predict future values for our small dataset.
One important thing about the predict() function in R is that it expects a similar dataframe with the same column name and type as the one you used in your model.
For example:
my_prediction < predict(my_model, data.frame(column_name = c(value_to_be_predicted))).
If you had used dates in numeric form, for example you would have:
my_date < "20160510"
date_df < data.frame(x=as.numeric(as.Date(my_date))) my_pred < predict(cubic_model, date_df)
In our example we used generic numbers with the name "x".
my_pred < predict(my_model, data.frame(x = c(31.1)))
One important thing about the predict() function in R is that it expects a similar dataframe with the same column name and type as the one you used in your model.
For example:
my_prediction < predict(my_model, data.frame(column_name = c(value_to_be_predicted))).
If you had used dates in numeric form, for example you would have:
my_date < "20160510"
date_df < data.frame(x=as.numeric(as.Date(my_date))) my_pred < predict(cubic_model, date_df)
In our example we used generic numbers with the name "x".
my_pred < predict(my_model, data.frame(x = c(31.1)))
Links
http://www.dummies.com/howto/content/howtopredictnewdatavalueswithr.html
http://www.rbloggers.com/splinesopeningtheblackbox/
http://statweb.stanford.edu/~jtaylo/courses/stats203/R/inference+polynomial/spline.R.html
http://data.princeton.edu/R/linearModels.html
http://www.rbloggers.com/firststepswithnonlinearregressioninr/
OBS: If you only desire to interpolate your data, create a "line" between your data points, check the smooth.spline function. It will interpolate your data, and you don't have to keep guessing the relation between data.
OBS2: If your data function is complex, if you are not being able to model your dataset correctly or if you are just willing to try new stuff, Neural Networks can be a very powerful way of learning your data.
No comments:
Post a Comment