Author: Omar Trejo Navarro

You know how people say “A picture is worth a thousand words”? Well, an animation can be worth even more! When trying to understand dynamics or illustrate concepts, animations can be very powerful. Lucky for us, R can be easily used to create them. Yes, that’s right! We don’t have to use any external software to create nice animations; we can simply use the tools we already know and love. Get ready to impress your friends and colleagues using them!

In this post I’ll show you two different ways to produce animations with the animation and gganimate packages. As we will see, they require different ways of thinking about animations, which them naturally handle different scenarios, but you can do most practical animations with any of them. If you want to dig deeper in to the subject, I’d recommend looking into the great plotly package as well.

If you find this post interesting, you may want to look into my recent book “R Programming by Example” published by Packt. If you do, please publish your opinion about it in the book’s page. I really appreciate it, and every comment can go a long way. Also, if you have any questions or feedback, please don’t hesitate to contact me through my website.

## What will we create?

A common statistics programming exercise is to find whether the confidence intervals associated to different random samples contain the real value for a parameter used to generate them. In this post we will take on the most common case: random samples from a normal distribution with known variance and unknown mean. This unknown mean is a value we actually known but will pretend not to, to find out whether a confidence interval contains it.

As you may know, when creating confidence intervals, you must specify a degree of confidence which implies the actual quantile in the distribution that will be used to generate the intervals, which in turn implies their width. This is intuitive: the more confident you want to be that a value is within an interval, the wider such interval should be. In this case we will use 95% confidence intervals, meaning that for every 100 confidence intervals we generate, in average, 5 of them should contain the “real” mean value used to generate the random samples.

To make sure the code for the animations is clear, we will separate the generation of the random samples and their corresponding confidence intervals from the animations. In the animation, if the confidence interval contains the “real” mean, it will be gray. If it doesn’t, it will be red to make it stand out. You can see below an image of the animation once everything is being shown.

Note that the animation will be different when you run the code yourself because it depends on the parameters used as well as the samples generated, which are not being controlled in this example and will surely be different for you. If you want to get the same animations every time you run the code for an example such as this one, make sure you use set.seed(12345) (with an integer of your preference) before executing the code.

## Generating our confidence intervals

We wil treat a couple of variables as globals. SD , MEAN , N_S , and N_CI are global parameters for the standard deviation, mean, number of observations in the random samples, and number of confidence intervals (which is a one-to-one relation with the number of random samples), respectively.

SD <- 1 MEAN <- 0 N_S <- 100 N_CI <- 50

We define the confidence_interval() to use the globals mentioned before to produce a normal random sample, using the rnomr() function, with N_S observations, MEAN mean, and SD standard deviation. Our confidence intervals will be tested against the condition that they contain the MEAN value. Once we get the mean for our current random sample as m , we can

compute the boundaries for its confidence interval by adding and subtracting the value for the corresponding critical value and adjusting it with sample size.

If we did not assume we knew the standard deviation, we would have to resort to another formula which makes use of the tStudent distribution, instead of the normal distribution. Also if we wanted 90% confidence intervals, we would use a 1.64 critical value instead, but you probably know all of that, so I’ll just focus on the code.

Some of you may not know why we need the three dots ( … ) in the function signature. Those three dots are called ellipsis, and they are there so that the function can accept an arbitrary number of arguments. We need them because we intend to use our function within an lapply() call, which sends the parameter by default (the current element in an iteration), and if we are not able to receive it as an argument to our confidence_interval() function we would get an error. We could just asÂ easily have used a name for such paremeter, but I think this is more elegant instead of having an explicit parameter which is unused, and may confuse future readers/users of our code.

confidence_interval <- function(...) { m <- mean(rnorm(N_S, MEAN, SD)) low <- m - 1.96 / sqrt(N_S) high <- m + 1.96 / sqrt(N_S) return(c(low, high)) }

The CIS object will contain a list of vectors, where each vector contains the confidence interval boundaries. We create it by simply running the confidence_interval() function N_CI times. Since our code is very simple and is making use of globals, we don’t need to pass it any parameters. Keep in mind that there’s a parameter that is actually being passed to the confidence_interval() function, which is the current value of the 1:N_CI list, but it will be caught in the ellipsis and will

not be used.

CIS <- lapply(1:N_CI, confidence_interval)

To take a look at our confidence intervals, we can simply print the CIS object. In this case we see that the first confidence interval goes from X to X, the second from X to X, and so on. We are going to use these values to generate our animations in the following sections. As you can see, these values correspond to the first three confidence intervals in the image shown above, and all of them contain the MEAN used which was 0 , so they appear in gray.

> CIS [[1]] [1] -0.1217169 0.2502831 [[2]] [1] -0.2287392 0.17326079 [[3]] [1] -0.2311497 0.3208503 (...)

In the following sections we will see how to generate the actual animations that wil show each confidence interval one by one.

## Using the animation package

The first animation we generate will be with the animation package, which was developed by Yihui Xie, a Statistics PhD from Iowa State University. It works by capturing different frames of a graph. Before we actually show the code for our animation, here’s a simple example that will create 10 frames using a for loop, where each iteration will plot a uniform random sample. As you can see, to generate animations with this package we create each frame separately and the saveGIF function, which receives an expression with our code inside, will take those images and stich them together into an “animation”.

Given this information you may realize that if you want a “smoother” animation, you willÂ probably want to create more frames and reduce the amount of movement among them, which would be correct. Unfortunately, you may run into problems when increasing the number of frames, as I have. You’ll have to explore what fits your use case.

saveGIF({ for (i in 1:10) plot(runif(10), ylim = 0:1) })

Now that you understand the basic mechanics of how animations are generated with this package, we will look into the code for our confidence intervals animation. The first thing to note is that we import the file we created in the previous section ( animations-with-R-base.R ) to make sure we have the CIS , MEAN , and N_CI objects available in memory. Next we load the animation package, and specify that we want .25 seconds between each frame in the animation, which is equivalent to 4 FPS (frames per second). Since we will create an animation with 50 confidence intervals, the total time for our animation will then be 12.5 seconds.

We send our expression to the saveGIF() function, in which we use a for loop as we had done before, and for each confidence interval, we create an empty plot (note the 0 in the first parameter), with N_CI values in the x-axis, and some values or the y-axis that will allow us to see properly the min/max values in our confidence intervals ([-0.6, 0.6] in this case).

We also provide some labels, as well as a horizontal grey line at the MEAN level which is 0 .

Next we extract the necessary values for each frame. These values will be composed of all confidence interval values up to i . Since i can be thought of as the frame we’re currently working with, this would mean that we get the confidence intervals from the first up to the one for the current frame. This will allow use to produce an effect where we add one by one

to the animation, but in reality were are getting the first i elements each time, not just adding the latest one. The actual extraction is done with a combination of the lapply() and unlist() functions, as well as an anonymous function that simply returns the corresponding value from the tuple for each confindence interval. This is done so that we have the values boundary values for the confidence in two separate vectors, which is required by the segments() function to actually show the lines in the graph.

The only thing missing is the color. In this case we apply a similar extraction mechanism as the one described in the previous paragraph, but instead of simply returning a value from a vector, we check whether all values are larger or smaller than the MEAN , in which case the confidence interval does not contain the MEAN value, and thus “red” is returned. Otherwise, “gray” is returned. The color object ends up being a list that has one element for each element in the x object (all other objects there also contain the same number of values). Finally we specify the line width to be a bit thicker with lwd = 3 .

library(animation) ani.options(interval = .25) saveGIF({ for (i in 1:N_CI) { plot( 0, xlim = c(1, N_CI), ylim = c(-0.6, 0.6), xlab = "Sample", ylab = "Confidence Interval", main = "Confidence Interval Animation" ) abline(h = MEAN, col = "gray", lwd = 2) x = 1:i y1 = unlist(lapply(CIS[1:i], function(ci) ci[[1]])) y2 = unlist(lapply(CIS[1:i], function(ci) ci[[2]])) color = unlist(lapply(CIS[1:i], function(ci) { if (all(ci > MEAN) || all(ci < MEAN)) { return("red") } return("gray") })) segments(x, y1, x, y2, color, lwd = 3) } })

When we execute the previous code, we can see the our animation, which is shown below. Pretty cool, isn’t it? You may want to look into the package’s examples to find more interesting use cases. You should also know that you can export as HTML, SWF, and even Latex files, if you so desire. Personally, I’ve found GIF to be portable enough to be used wherever I need to.

## Using the gganimate package

Now we are going to create a similar animation, but we will use the gganimate package in this case, which was developed by David Robinson, Chief Data Scientist at DataCamp, as well as a Computational Biology PhD from Princeton, and book author. This package was designed to work with ggplot2 and use data frames to as input. It extends the the aes() function (aesthetics), used by the ggplot() function, to accept frame and cumulative paremeters. Those paremeters

indicate which observations should be used in which frames, and whether the process should be accumulative. If this last parameter is set to TRUE , we don’t need to manually accumulte objects in the graph as we did with the animate package.

As you can see, we need to import both the ggplot2 and gganimate packages. Then we create our data frame with a similar structure to the one we had in separate vectors in the previous example. Naturally we have one observation for each frame in our animation. Sometimes this will not come naturally and you’ll need for data preparation, but this case is simple in this regard.

Next we create a graph we normally would with the ggplot() function, but we specify two new parameters in the aes() function: frame and cumulative . The frame parameter lets gganimate know which variable indicates the data that should be used for each frame, and cumulative produces the behavior mentioned above. Everything else is standard ggplot2 code. As we did before, we add a horizontal segment, as well as a segment for each confindence interval, we

specify the colors, and remove the legend because it’s not useful in this case.

library(ggplot2) library(gganimate) data <- data.frame( x1 = 1:N_CI, x2 = 1:N_CI, y1 = unlist(lapply(CIS, function(ci) ci[[1]])), y2 = unlist(lapply(CIS, function(ci) ci[[2]])), color = unlist(lapply(CIS, function(ci) { if (all(ci > MEAN) || all(ci < MEAN)) { return("red") } return("gray") })) ) plot <- ggplot(data, aes(color = color, frame = x1, cumulative = TRUE)) + geom_segment( aes_string(x = 0, xend = 50, y = 0, yend = 0), color = "gray" ) + geom_segment( aes(x = x1, xend = x2, y = y1, yend = y2), size = 1.5 ) + scale_color_manual(values = c("gray", "red")) + theme(legend.position = "none")

If you want to see the final frame for the animation (where all the accumulation has been done), you can do so with simply printing the object (as you would with any ggplot2 graph). This is very helpful because you can see that the final result is what you expect, without producing the animation.

print(plot)

To actually see the animation, you need to pass the plot object to the gganimate() function which will generate it and show it to you. The result is shown below.

gganimate(plot)

## Summary

In this post we have seen two different ways to create animations within R. They operate in slightly different ways, and they may up being more intuitive to use in different situations. I prefer working with gganimate rather than with animate because it uses tools that I normally find very intuitive ( ggplot2 and data frames)q, but also because it doesn’t require an explicit

for loop to create each frame, you can simply prepare the data and create a normal gpplot() graph specifying which observations should be used in each frame. I find this approach very intuitive, but I know people who find the explicit looping approach more intuitive, which one do you find more interesting? Let me know in the comments!

Now that you have the fundamentals for animations with R, imagine all the possibilities. All those times that your colleagues were confused about what you were thinking, and you said to yourself “If only I could show them this with an animation, it

### R Programming by Example

R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Given the obvious advantages that R brings to the table, it has become imperative for data scientists to learn the essentials of the language.

R Programming by Example is a hands-on guide that helps you develop a strong fundamental base in R by taking you through a series of illustrations and examples. Written by Omar Trejo Navarro, the book starts with the basic concepts and gradually progresses towards more advanced concepts to give you a holistic view of R. Omar is a well-respected data consultant with expertise in applied mathematics and economics.

If you are an aspiring data science professional or statistician and want to learn more about R programming in a practical and engaging manner, R Programming by Example is the book for you!