Posts

Showing posts from September, 2015

Working with Big Datasets in R

When dealing with a significant amount of data in R the are some points to consider. How do I know if my data is too big? Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory. As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to check how large is the dataset in question, and if it does fit in memory . Remember that you actually should have at least double memory of the size of your dataset. So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory. If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately. You can use the command split to do this in Linux: split -l 10000 file.txt new_file This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each. Well, once you know your date will fit into memory, you can read it with th