Problem: How do I generate a small random sample of large CSV to be read into R?
Solution: subsample
One solution to dealing with large datasets is to read the data in smaller chunks, and then combine the pieces together. This isn’t trivial. And it may not be worthwhile, especially if you just want to poke around.
What about just looking at a single one of those smaller chunks?
That might be a better approach, particularly if this is purely exploratory. But the problem here is how to read in a representative sample. Any of the “read” functions (i.e. read.csv(), read.delim(), etc.) give you a max number of lines to read in. But the data could be sorted in a non-random way so that the first n lines are biased. So what about generating a small random sample outside of R and then reading that in?
I’ve found several solutions to this problem on Stack Overflow, most of which involve some Perl or BASH scripting … but there’s an easier way …
subsample
is a command line tool built with Python. As long as you have pip
installed, you can use the following:
pip install subsample
The workflow is simple:
- Identify the original CSV file to sample
- Decide how many rows you want
- Pipe that to a new file
Implemented in code:
subsample -n 1000 purple.csv > purple_sample.csv
Then to get that into R:
rain <- read.csv("purple_sample.csv")
The documentation details things like dealing with header rows and setting a random seed to recreate samples.