Paralell processing map()
In this post, we yet again build on the former post, this time to understand parallel processing a bit more. Remember the poll from the last post, where we calculated CI:s from a poll?
Where the results looked like this?
|Party||# of votes||Share|
|Data from here|
Good. In it, we calculated the CI:s using
purrr and a function we’d written that used
infer to bootstrap and calculate CI:s. That was good, and we managed to cut down on the amount of code we wrote, quite a bit.
If you try it yourself however, you’ll realise that the calculation takes a while. Not an absurd amount of time, especially when compared to other calculations that takes time for real, but it works as a good intro to parallel processing. Let’s take a look at how long the process takes, using the package
tictoc, which measures the processor time it takes to go from the
tic() to the
set.seed(123) library(tictoc) tic() complete_CI <- map_dfr(unique(voter$party), ci_calculation, voter) toc()
## 10.95 sec elapsed
Not to slow, but imagine if you had a larger dataset with 10 000 votes instead, or you wanted to run far more simulations, then the time adds up. Instead, we can use a variation of the
map_*() function, that takes each calculation and runs it in parallel instead of sequence. Here it’ll probably take a bit longer due to the
plan(multiprocess), which takes some extra time to set up, but in a larger setting you’ll save time.
set.seed(123) plan(multiprocess) tic() complete_CI_paralell <- future_map_dfr(unique(voter$party), ci_calculation, voter) toc()
## 4.693 sec elapsed
There we go! Just a bit more than half the time! Just for fun, let’s look at how much time we actually saved, by putting the
plan(multiprocess) between the tictoc.
set.seed(123) tic() plan(multiprocess) complete_CI_paralell <- future_map_dfr(unique(voter$party), ci_calculation, voter) toc()
## 8.596 sec elapsed
So only around a second here, but still, better! And more importantly, we’ve learned something.