# Understanding Regression Trees

Learn about regression trees in this tutorial by Giuseppe Ciaburro, a Ph.D. in environmental technical physics with over 15 years of experience in programming with Python, R, and MATLAB, in the field of combustion, acoustics, and noise control.

Decision trees are used to predict a response (class y) from several input variables: x1, x2,…,xn. If y is a continuous response, it’s called a regression tree; if y is categorical, it’s called a classification tree. That’s why these methods are often called Classification and Regression Trees (CART). The algorithm checks the value of an input (xi) at every node of the tree and continues to the left or right branch based on the (binary) answer. When you reach a leaf, you will find the prediction.

The algorithm starts from grouped data into a single node (root node) and executes a comprehensive recursion of all possible subdivisions at every step. At each step, the best subdivision (the one that produces as many homogeneous branches as possible) is chosen.

In regression trees, you try to partition the data space into small enough parts where you can apply a simple yet different model on each part. The non-leaf part of the tree is just the procedure to determine for each data x the model you will use to predict it.

A regression tree is formed by a series of nodes that split the root branch into two child branches. Such subdivision continues to cascade. Each new branch can, then, go to another node or remain a leaf with the predicted value.

Starting from the whole dataset (root), the algorithm creates the tree through the following procedure:

1. Identify the best functionality to divide the X1 dataset and the best s1 division value. The left-hand branch will be the set of observations where X1 is below s1, while the right-hand branch comprises the set of observations in which X1 is greater than or equal to s1.
2. This operation is then recursively executed again (independently) for every branch until there is no possibility of division.
3. When the divisions are completed, a leaf is created, which indicates the output values.

Suppose you have a variable response to only two continuous predictors (X1 and X2) and four division values (s1s2s3s4). The following figure proposes a way to represent the whole dataset graphically: The goal of a regression tree is to encapsulate the whole dataset in the smallest possible tree. To minimize the tree size, the simplest possible explanation for a set of observations is preferred over other explanations. All this is justified by the fact that small trees are much easier to comprehend than large trees.

You saw how the regression tree algorithm works. These steps can be summarized in the following processes:

• Splitting: The dataset is partitioned into subsets. The split operation is based on a set of rules, for example, sums of squares from the whole dataset. The leaf node contains a small subset of the observations. Splitting continues until a leaf node is constructed.
• Pruning: In this process, the tree branches are shortened. The tree is reduced by transforming a few nodes of branches into leaf nodes and removing leaf nodes under the original branch. Care must be taken as the lower branches can be strongly influenced by abnormal values. Pruning allows you to find the next largest tree and minimize the problem. A simpler tree often avoids overfitting.
• Tree selection: Finally, the smallest tree that matches the data is selected. This process is executed by choosing the tree that produces the lowest cross-validated error.

To fit a regression tree in R, you can use the tree() function implemented in the tree package. In this package, a tree is grown via binary recursive partitioning by using the response in the specified formula and choosing splits from the terms of the right-hand side. Numeric variables are divided into X < a and X > a. The split that maximizes the reduction in impurity is chosen, the dataset split and the process repeated. Splitting continues until the terminal nodes are too small or too few to be split. Take a look at the following table for basic information on this package:

 Package tree Date January 21, 2016 Version 1.0-37 Title Classification and Regression Trees Author Brian Ripley

To perform a regression tree example, begin with the data. Use the mtcars dataset contained in the datasets package. You can extract the data from the 1974 Motor Trend US magazine. It comprises fuel consumption and ten aspects of automobile design and performance for 32 automobiles (1973–74 models). The mtcars dataset also contains gas mileage, horsepower, and other information for 32 vehicles. It is a data frame with 32 observations on the following 11 variables:

• mpg: Miles per gallon
• cyl: Number of cylinders
• disp: Engine displacement (cubic inches)
• hp: Engine horsepower
• drat: Rear axle ratio
• wt: Weight (1000lbs)
• qsec: 1/4mile time
• vs: V/S
• am: Transmission (0 = automatic1 = manual)
• gear: Number of forward gears
• carb: Number of carburetors

The fuel consumption of vehicles has always been studied by major manufacturers of the entire planet. In an era characterized by oil refueling problems and even greater air pollution problems, fuel consumption by vehicles has become a key factor. In this example, you’ll build a regression tree with the purpose of predicting the fuel consumption of vehicles according to certain characteristics.

`data(mtcars)`

The dataset is contained in the datasets package; to load it, use the data() function. To display a compact summary of the dataset simply type:

`str(mtcars)`

The results are shown as follows:

```> str(mtcars)'
data.frame':  32 obs. of  11 variables:
\$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
\$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
\$ disp: num  160 160 108 258 360 ...
\$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
\$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
\$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
\$ qsec: num  16.5 17 18.6 19.4 17 ...
\$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
\$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
\$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
\$ carb: num  4 4 1 1 2 1 4 2 2 4 ...```

You have thus confirmed that these are 11 numeric variables with 32 observations. To extract more information, use the summary() function:

```> summary(mtcars)
mpg             cyl             disp             hp
Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0
1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5
Median :19.20   Median :6.000   Median :196.3   Median :123.0
Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7
3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0
Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0
drat             wt             qsec             vs
Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000
1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000
Median :3.695   Median :3.325   Median :17.71   Median :0.0000
Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375
3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000
Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000
am              gear            carb
Min.   :0.0000  Min.   :3.000   Min.   :1.000
1st Qu.:0.0000  1st Qu.:3.000   1st Qu.:2.000
Median :0.0000   Median :4.000  Median :2.000
Mean   :0.4062   Mean   :3.688   Mean   :2.812
3rd Qu.:1.0000  3rd Qu.:4.000   3rd Qu.:4.000
Max.   :1.0000   Max.   :5.000   Max.   :8.000```

Before starting with data analysis, conduct an exploratory analysis to understand how the data is distributed and extract preliminary knowledge. First, try to find out whether the variables are related to each other. You can do this using the pairs() function to create a matrix of sub-axes containing scatter plots of the columns of a matrix. To reduce the number of plots in the matrix, limit your analysis to just four predictors: cylinders, displacement, horsepower, and weight. The target is the mpg variable that contains the miles per gallon of 32 sample cars:

`pairs(mpg~cyl+disp+hp+wt,data=mtcars)`

To specify the response and predictors, the formula argument is used. Each term gives a separate variable in the pairs plot, so terms must be numeric vectors. The response is interpreted as another variable, but not treated specially. The following figure shows a scatter plot matrix: By observing the plots in the first line, it can be noted that fuel consumption increases as the number of cylinders, the engine displacement, the horsepower, and the weight of the vehicle increases.

At this point, you can use the tree() function to build the regression tree. First, install the tree package. To install a library that is not present in the initial distribution of R, you must use the install.package function. This is the main function used to install packages. It takes a vector of names and a destination library, downloads the packages from the repositories, and installs them.

Now, load the library through the library command:

`library(tree)`

You can use the tree() function that builds a regression tree:

`RTModel <- tree(mpg~.,data = mtcars)`

Only two arguments are passed—a formula and the dataset name. The left-hand side of the formula (response) should be a numerical vector when a regression tree is fitted. The right-hand side should be a series of numeric variables separated by +; there should be no interaction terms. Both . and – are allowed; regression trees can have offset terms.

Here are the results:

```> RTModel
node), split, n, deviance, yval
* denotes terminal node

1) root 32 1126.000 20.09
2) wt < 2.26 6   44.550 30.07 *
3) wt > 2.26 26  346.600 17.79
6) cyl < 7 12   42.120 20.92
12) cyl < 5 5    5.968 22.58 *
13) cyl > 5 7   12.680 19.74 *
7) cyl > 7 14   85.200 15.10
14) hp < 192.5 7   16.590 16.79 *
15) hp > 192.5 7   28.830 13.41 *```

These results describe exactly each node in the tree. Information on each node is presented in an indented format. It is used to indicate the tree topology; that is, it indicates the parent and child relationships (also referred to as primary and secondary splits). Also, to denote a terminal node, an asterisk (*) is used.

In the tree sequence, nodes are labeled with unique numbers. These numbers are generated by the following formula: the child nodes of a node x are always numbered 2*x (left child) and 2*x+1 (right child). The root node is numbered as one. The following figure explains this rule: From the analysis of the results, you can see a selection of variables; in fact, between the ten available variables, only three—wt, cyl, and hp—were selected. More information can be obtained from the summary() function:

```> summary(RTModel)

Regression tree:
tree(formula = mpg ~ ., data = mtcars)
Variables actually used in tree construction:
 "wt"  "cyl" "hp"
Number of terminal nodes:  5
Residual mean deviance:  4.023 = 108.6 / 27
Distribution of residuals:
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-4.067  -1.361   0.220   0.000   1.361   3.833```

The output of summary() indicates that only three of the variables have been used in constructing the tree. In the context of a regression tree, the deviance is simply the sum of squared errors for the tree. Now, you can plot the regression tree:

`plot(RTModel)text(RTModel)`

The first one plots the regression tree, while the second one adds the text on the branches to explain the workflow. The resulting plot is shown in the following figure: Now look at what the regression tree has returned. The first thing that seems obvious is a sort of indication of the importance of variables. The choice of three predictors for the ten available variables already makes you realize that these three are the ones that most affect the fuel consumption of cars inserted in the dataset.

Now, you can add that the most important predictor is the weight of the vehicle; in fact, a weight less than 2.26 lbs leads you to a terminal knot, which gives a consumption estimate (30.07 miles/(US) gallon). You can then see this immediately after you find the number of cylinders of the engine and the horsepower.

If you found this article interesting, you can explore Giuseppe Ciaburro’s Regression Analysis with R to build effective regression models in R to extract valuable insights from real data. This book will give you a rundown explaining what regression analysis is, explaining to you the process from scratch.

# Working with Date Objects in R

Learn how to work with date objects in R in this tutorial by Kuntal Ganguly, a big data analytics engineer focused on building large-scale, data-driven systems using big data frameworks and machine learning.

The base R package provides date functionality. This article will show you several date-related operations in R. You’ll only be using features from the base package and not from any external data. Therefore, you do not need to perform any preparatory steps.

R internally represents dates as the number of days from January 1, 1970. It represents dates as Unix time or epoch time, which is defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970. So, zero corresponds to January 1, 1970 and so on. You can convert positive and negative numbers to dates. Negative numbers give dates before January 1, 1970.

1. Get started with today’s date:
`> Sys.Date()`
1. Create a date object from a string:
```# Supply year as two digits

# Note correspondence between separators in the date string and the format string

as.Date("1/1/80", format = "%m/%d/%y")
 "1980-01-01"

# Supply year as 4 digits

# Note uppercase Y below instead of lowercase y as above

as.Date("1/1/1980", format = "%m/%d/%Y")
 "1980-01-01"

# If you omit format string, you must give date as "yyyy/mm/dd" or as "yyyy-mm-dd"

as.Date("1970/1/1")  "1970-01-01"  > as.Date("70/1/1")
 "0070-01-01"```
1. Use other options for separators (this example uses hyphens) in the format string, and also see the underlying numericvalue:
```> dt <- as.Date("1-1-70", format = "%m-%d-%y")
> as.numeric(dt)
 0```
1. Explore other format string options:
```> as.Date("Jan 15, 2015", format = "%b %d, %Y")
 "2015-01-15"
> as.Date("January 15, 15", format = "%B %d, %y")
 "2015-01-15"```
1. Create dates from numbers by typecasting:
```>dt <- 1000
> class(dt) <- "Date"
> dt                 # 1000 days from 1/1/70
 "1972-09-27"
dt <- -1000
> class(dt) <- "Date"
dt                 # 1000 days before 1/1/70
 "1967-04-07"```
1. Create dates directly from numbers by setting the origin date:
```> as.Date(1000, origin = as.Date("1980-03-31"))
 "1982-12-26"
> as.Date(-1000, origin = as.Date("1980-03-31"))
 "1977-07-05"```
1. Examine the date components:
```> dt <- as.Date(1000, origin = as.Date("1980-03-31/"))
> dt
 "1982-12-26"
> # Get year as four digits
> format(dt, "%Y")
 "1982"
> # Get the year as a number rather than as character string
> as.numeric(format(dt, "%Y"))
 1982
> # Get year as two digits
> format(dt, "%y")
 "82"
> # Get month
> format(dt, "%m")
 "12"
> as.numeric(format(dt, "%m"))
 12
> # Get month as string
> format(dt, "%b")
 "Dec"
> format(dt, "%B")
 "December"
> months(dt)
 "December"
> weekdays(dt)
 "Sunday"
> quarters(dt)
 "Q4"
> julian(dt)
 4742
attr(,"origin")
 "1970-01-01"
> julian(dt, origin = as.Date("1980-03-31/"))
 1000
attr(,"origin")
 "1980-03-31"```

## How this works

Step 1 shows how to get the system date. Steps 2 through 4 show how to create dates from strings. You can see that by specifying the format string appropriately, you can read dates from almost any string representation. You can use any separator as long as you mimic them in the format string. The following table summarizes the formatting options for the components of the date:

 Format Specifier Description %d Day of month as a number, for example, 15 %m Month as a number, for example, 10 %b Abbreviated string representation of a month, for example, Jan %B Complete string representation of a month, for example, January %y Year as two digits, for example, 87 %Y Year as four digits, for example, 2001

Step 5 shows how an integer can be typecast as a date. Step 6 shows how to find the date with a specific offset from a given date (origin). Finally, step 7 shows how to examine the individual components of a date object using the format function along with the appropriate format specification (refer to the preceding table) for the desired component.

Step 6 also shows the use of the months, weekdays, and julian functions for getting the month, day of the week, and the Julian date corresponding to a date. If you omit the origin in the julian function, R assumes January 1, 1970, as the origin.

## Operating on date objects

R supports many useful manipulations with date objects, such as date addition and subtraction, and the creation of date sequences. This example shows many of these operations in action. The base R package provides the date functionality, and you do not need any preparatory steps.

1. Perform addition and subtraction of days from date objects:
```> dt <- as.Date("1/1/2001", format = "%m/%d/%Y")
> dt
 "2001-01-01"

> dt + 100                 # Date 100 days from dt
 "2001-04-11"

> dt + 31
 "2001-02-01"```
1. Subtract date objects to find the number of days between two dates:
```> dt1 <- as.Date("1/1/2001", format = "%m/%d/%Y")
> dt2 <- as.Date("2/1/2001", format = "%m/%d/%Y")
> dt1-dt1  Time difference of 0 days  > dt2-dt1

Time difference of 31 days
> dt1-dt2

Time difference of -31 days
> as.numeric(dt2-dt1)
 31```
1. Compare the date objects:
```> dt2 > dt1
 TRUE

> dt2 == dt1
 FALSE```
1. Create date sequences:
```> d1 <- as.Date("1980/1/1")
> d2 <- as.Date("1982/1/1")
> # Specify start date, end date and interval
> seq(d1, d2, "month")
 "1980-01-01" "1980-02-01" "1980-03-01" "1980-04-01"
 "1980-05-01" "1980-06-01" "1980-07-01" "1980-08-01"
 "1980-09-01" "1980-10-01" "1980-11-01" "1980-12-01"
 "1981-01-01" "1981-02-01" "1981-03-01" "1981-04-01"
 "1981-05-01" "1981-06-01" "1981-07-01" "1981-08-01"
 "1981-09-01" "1981-10-01" "1981-11-01" "1981-12-01"
 "1982-01-01"

> d3 <- as.Date("1980/1/5")
> seq(d1, d3, "day")
 "1980-01-01" "1980-01-02" "1980-01-03" "1980-01-04"
 "1980-01-05"

> # more interval options
> seq(d1, d2, "2 months")
 "1980-01-01" "1980-03-01" "1980-05-01" "1980-07-01"
 "1980-09-01" "1980-11-01" "1981-01-01" "1981-03-01"
 "1981-05-01" "1981-07-01" "1981-09-01" "1981-11-01"
 "1982-01-01"

> # Specify start date, interval and sequence length
> seq(from = d1, by = "4 months", length.out = 4 )
 "1980-01-01" "1980-05-01" "1980-09-01" "1981-01-01"```
1. Find a future or past date from a given date, based on an interval:
```> seq(from = d1, by = "3 weeks", length.out = 2)
  "1980-01-22"```

## How this works

Step 1 shows how you can add and subtract days from a date to get the resulting date. Step 2 shows how you can find the number of days between two dates through subtraction. The result is a difftime object that you can convert into a number if needed.

Step 3 shows the logical comparison of dates and step 4 shows two different ways of creating sequences of dates. In one, you specify the from date, the to date, and the fixed interval by between the sequence elements as a string. In the other, you specify the from date, the interval, and the number of sequence elements you want. If you’re using the latter approach, you have to name the arguments.

Finally, Step 5 shows how you can create sequences by specifying the intervals in a flexible manner.

If you found this article interesting, you can explore Kuntal Ganguly’s R Data Analysis Cookbook – Second Edition to put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. This book has over 80 recipes to help you breeze through your data analysis projects using R.

# Two Simple Animation Techniques with R

Author: Omar Trejo Navarro

You know how people say “A picture is worth a thousand words”? Well, an animation can be worth even more! When trying to understand dynamics or illustrate concepts, animations can be very powerful. Lucky for us, R can be easily used to create them. Yes, that’s right! We don’t have to use any external software to create nice animations; we can simply use the tools we already know and love. Get ready to impress your friends and colleagues using them!

In this post I’ll show you two different ways to produce animations with the animation and gganimate packages. As we will see, they require different ways of thinking about animations, which them naturally handle different scenarios, but you can do most practical animations with any of them. If you want to dig deeper in to the subject, I’d recommend looking into the great plotly package as well.

If you find this post interesting, you may want to look into my recent book “R Programming by Example” published by Packt. If you do, please publish your opinion about it in the book’s page. I really appreciate it, and every comment can go a long way. Also, if you have any questions or feedback, please don’t hesitate to contact me through my website.

## What will we create?

A common statistics programming exercise is to find whether the confidence intervals associated to different random samples contain the real value for a parameter used to generate them. In this post we will take on the most common case: random samples from a normal distribution with known variance and unknown mean. This unknown mean is a value we actually known but will pretend not to, to find out whether a confidence interval contains it.

As you may know, when creating confidence intervals, you must specify a degree of confidence which implies the actual quantile in the distribution that will be used to generate the intervals, which in turn implies their width. This is intuitive: the more confident you want to be that a value is within an interval, the wider such interval should be. In this case we will use 95% confidence intervals, meaning that for every 100 confidence intervals we generate, in average, 5 of them should contain the “real” mean value used to generate the random samples.

To make sure the code for the animations is clear, we will separate the generation of the random samples and their corresponding confidence intervals from the animations. In the animation, if the confidence interval contains the “real” mean, it will be gray. If it doesn’t, it will be red to make it stand out. You can see below an image of the animation once everything is being shown. Note that the animation will be different when you run the code yourself because it depends on the parameters used as well as the samples generated, which are not being controlled in this example and will surely be different for you. If you want to get the same animations every time you run the code for an example such as this one, make sure you use set.seed(12345) (with an integer of your preference) before executing the code.

## Generating our confidence intervals

We wil treat a couple of variables as globals. SD , MEAN , N_S , and N_CI are global parameters for the standard deviation, mean, number of observations in the random samples, and number of confidence intervals (which is a one-to-one relation with the number of random samples), respectively.

```SD <- 1
MEAN <- 0
N_S <- 100
N_CI <- 50```

We define the confidence_interval() to use the globals mentioned before to produce a normal random sample, using the rnomr() function, with N_S observations, MEAN mean, and SD standard deviation. Our confidence intervals will be tested against the condition that they contain the MEAN value. Once we get the mean for our current random sample as m , we can
compute the boundaries for its confidence interval by adding and subtracting the value for the corresponding critical value and adjusting it with sample size.

If we did not assume we knew the standard deviation, we would have to resort to another formula which makes use of the tStudent distribution, instead of the normal distribution. Also if we wanted 90% confidence intervals, we would use a 1.64 critical value instead, but you probably know all of that, so I’ll just focus on the code.

Some of you may not know why we need the three dots ( … ) in the function signature. Those three dots are called ellipsis, and they are there so that the function can accept an arbitrary number of arguments. We need them because we intend to use our function within an lapply() call, which sends the parameter by default (the current element in an iteration), and if we are not able to receive it as an argument to our confidence_interval() function we would get an error. We could just as easily have used a name for such paremeter, but I think this is more elegant instead of having an explicit parameter which is unused, and may confuse future readers/users of our code.

```confidence_interval <- function(...) {
m <- mean(rnorm(N_S, MEAN, SD))
low <- m - 1.96 / sqrt(N_S)
high <- m + 1.96 / sqrt(N_S)
return(c(low, high))
}```

The CIS object will contain a list of vectors, where each vector contains the confidence interval boundaries. We create it by simply running the confidence_interval() function N_CI times. Since our code is very simple and is making use of globals, we don’t need to pass it any parameters. Keep in mind that there’s a parameter that is actually being passed to the confidence_interval() function, which is the current value of the 1:N_CI list, but it will be caught in the ellipsis and will
not be used.

`CIS <- lapply(1:N_CI, confidence_interval)`

To take a look at our confidence intervals, we can simply print the CIS object. In this case we see that the first confidence interval goes from X to X, the second from X to X, and so on. We are going to use these values to generate our animations in the following sections. As you can see, these values correspond to the first three confidence intervals in the image shown above, and all of them contain the MEAN used which was 0 , so they appear in gray.

```> CIS
[]
 -0.1217169 0.2502831
[]
 -0.2287392 0.17326079
[]
 -0.2311497 0.3208503
(...)```

In the following sections we will see how to generate the actual animations that wil show each confidence interval one by one.

## Using the animation package

The first animation we generate will be with the animation package, which was developed by Yihui Xie, a Statistics PhD from Iowa State University. It works by capturing different frames of a graph. Before we actually show the code for our animation, here’s a simple example that will create 10 frames using a for loop, where each iteration will plot a uniform random sample. As you can see, to generate animations with this package we create each frame separately and the saveGIF function, which receives an expression with our code inside, will take those images and stich them together into an “animation”.
Given this information you may realize that if you want a “smoother” animation, you will  probably want to create more frames and reduce the amount of movement among them, which would be correct. Unfortunately, you may run into problems when increasing the number of frames, as I have. You’ll have to explore what fits your use case.

```saveGIF({
for (i in 1:10) plot(runif(10), ylim = 0:1)
})```

Now that you understand the basic mechanics of how animations are generated with this package, we will look into the code for our confidence intervals animation. The first thing to note is that we import the file we created in the previous section ( animations-with-R-base.R ) to make sure we have the CIS , MEAN , and N_CI objects available in memory. Next we load the animation package, and specify that we want .25 seconds between each frame in the animation, which is equivalent to 4 FPS (frames per second). Since we will create an animation with 50 confidence intervals, the total time for our animation will then be 12.5 seconds.

We send our expression to the saveGIF() function, in which we use a for loop as we had done before, and for each confidence interval, we create an empty plot (note the 0 in the first parameter), with N_CI values in the x-axis, and some values or the y-axis that will allow us to see properly the min/max values in our confidence intervals ([-0.6, 0.6] in this case).

We also provide some labels, as well as a horizontal grey line at the MEAN level which is 0 .
Next we extract the necessary values for each frame. These values will be composed of all confidence interval values up to i . Since i can be thought of as the frame we’re currently working with, this would mean that we get the confidence intervals from the first up to the one for the current frame. This will allow use to produce an effect where we add one by one
to the animation, but in reality were are getting the first i elements each time, not just adding the latest one. The actual extraction is done with a combination of the lapply() and unlist() functions, as well as an anonymous function that simply returns the corresponding value from the tuple for each confindence interval. This is done so that we have the values boundary values for the confidence in two separate vectors, which is required by the segments() function to actually show the lines in the graph.

The only thing missing is the color. In this case we apply a similar extraction mechanism as the one described in the previous paragraph, but instead of simply returning a value from a vector, we check whether all values are larger or smaller than the MEAN , in which case the confidence interval does not contain the MEAN value, and thus “red” is returned. Otherwise, “gray” is returned. The color object ends up being a list that has one element for each element in the x object (all other objects there also contain the same number of values). Finally we specify the line width to be a bit thicker with lwd = 3 .

```library(animation)
ani.options(interval = .25)
saveGIF({
for (i in 1:N_CI) {
plot(
0,
xlim = c(1, N_CI),
ylim = c(-0.6, 0.6),
xlab = "Sample",
ylab = "Confidence Interval",
main = "Confidence Interval Animation"
)
abline(h = MEAN, col = "gray", lwd = 2)
x = 1:i
y1 = unlist(lapply(CIS[1:i], function(ci) ci[]))
y2 = unlist(lapply(CIS[1:i], function(ci) ci[]))
color = unlist(lapply(CIS[1:i], function(ci) {
if (all(ci > MEAN) || all(ci < MEAN)) {
return("red")
}
return("gray")
}))
segments(x, y1, x, y2, color, lwd = 3)
}
})```

When we execute the previous code, we can see the our animation, which is shown below. Pretty cool, isn’t it? You may want to look into the package’s examples to find more interesting use cases. You should also know that you can export as HTML, SWF, and even Latex files, if you so desire. Personally, I’ve found GIF to be portable enough to be used wherever I need to. ## Using the gganimate package

Now we are going to create a similar animation, but we will use the gganimate package in this case, which was developed by David Robinson, Chief Data Scientist at DataCamp, as well as a Computational Biology PhD from Princeton, and book author. This package was designed to work with ggplot2 and use data frames to as input. It extends the the aes() function (aesthetics), used by the ggplot() function, to accept frame and cumulative paremeters. Those paremeters
indicate which observations should be used in which frames, and whether the process should be accumulative. If this last parameter is set to TRUE , we don’t need to manually accumulte objects in the graph as we did with the animate package.

As you can see, we need to import both the ggplot2 and gganimate packages. Then we create our data frame with a similar structure to the one we had in separate vectors in the previous example. Naturally we have one observation for each frame in our animation. Sometimes this will not come naturally and you’ll need for data preparation, but this case is simple in this regard.

Next we create a graph we normally would with the ggplot() function, but we specify two new parameters in the aes() function: frame and cumulative . The frame parameter lets gganimate know which variable indicates the data that should be used for each frame, and cumulative produces the behavior mentioned above. Everything else is standard ggplot2 code. As we did before, we add a horizontal segment, as well as a segment for each confindence interval, we
specify the colors, and remove the legend because it’s not useful in this case.

```library(ggplot2)
library(gganimate)
data <- data.frame(
x1 = 1:N_CI,
x2 = 1:N_CI,
y1 = unlist(lapply(CIS, function(ci) ci[])),
y2 = unlist(lapply(CIS, function(ci) ci[])),
color = unlist(lapply(CIS, function(ci) {
if (all(ci > MEAN) || all(ci < MEAN)) {
return("red")
}
return("gray")

}))
)
plot <- ggplot(data, aes(color = color, frame = x1, cumulative = TRUE)) +
geom_segment(
aes_string(x = 0, xend = 50, y = 0, yend = 0), color = "gray"
) +
geom_segment(
aes(x = x1, xend = x2, y = y1, yend = y2), size = 1.5
) +
scale_color_manual(values = c("gray", "red")) +
theme(legend.position = "none")```

If you want to see the final frame for the animation (where all the accumulation has been done), you can do so with simply printing the object (as you would with any ggplot2 graph). This is very helpful because you can see that the final result is what you expect, without producing the animation.

`print(plot)`

To actually see the animation, you need to pass the plot object to the gganimate() function which will generate it and show it to you. The result is shown below.

`gganimate(plot)` ## Summary

In this post we have seen two different ways to create animations within R. They operate in slightly different ways, and they may up being more intuitive to use in different situations. I prefer working with gganimate rather than with animate because it uses tools that I normally find very intuitive ( ggplot2 and data frames)q, but also because it doesn’t require an explicit
for loop to create each frame, you can simply prepare the data and create a normal gpplot() graph specifying which observations should be used in each frame. I find this approach very intuitive, but I know people who find the explicit looping approach more intuitive, which one do you find more interesting? Let me know in the comments!
Now that you have the fundamentals for animations with R, imagine all the possibilities. All those times that your colleagues were confused about what you were thinking, and you said to yourself “If only I could show them this with an animation, it

### R Programming by Example

R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Given the obvious advantages that R brings to the table, it has become imperative for data scientists to learn the essentials of the language.

R Programming by Example is a hands-on guide that helps you develop a strong fundamental base in R by taking you through a series of illustrations and examples. Written by Omar Trejo Navarro, the book starts with the basic concepts and gradually progresses towards more advanced concepts to give you a holistic view of R. Omar is a well-respected data consultant with expertise in applied mathematics and economics.

If you are an aspiring data science professional or statistician and want to learn more about R programming in a practical and engaging manner, R Programming by Example is the book for you! # R variables types

In this post we will deepen the concept of variables in the R environment by providing an explanation of the different types of variables. The R language provides two types of variables:

1) global variables;

2) local variables;

As you can guess, global variables are accessible globally within the program, local variables instead take on meaning only and exclusively in the sector of belonging, resulting only visible within the method in which they are initialized.

For most compilers, a variable name can contain up to thirty-one characters, so that a sufficiently descriptive name can be used for a variable, in R this limit is not indicated. The choice of the name takes on fundamental importance in order to make the code readable; this is because a readable code will be easily maintained even by people other than the programmer who created it.

We have talked about initialization of the variable as an operation to create the variable; let’s see then a trivial example:

`> a <- 1`

In this instruction the assignment operator (<-) was used, with the meaning of assigning precisely to the memory location identified by the name to the value 1. The type assigned to the variable is established during the initialization phase; it will then be decided whether to assign to it a text string, a Boolean value (true / false), a decimal number etc.

# Variables and expressions in R

Variables and expressions in R are treated in a modern and efficient way, in fact while in most programming languages ​​a declaration of the variables used in the program is required, declaration made in the initial part before the executive section of the same, in R all this is not required. Because language does not require the declaration of variables; the type and its size will be decided when they are initialized.

The term variable refers to a type of data whose value is variable during the execution of the program. However, it is possible to assign an initial value to it, so we will talk about initializing the variable. The initialization phase assumes a fundamental importance because it represents the moment in which the variable is created, this moment coincides with that in which a given value is associated with it.

Unlike the so-called compilation languages, this procedure can be inserted anywhere in the script, even if the meanings can take different values.

When the interpreter encounters a variable, it deposits the relative value in a memory location and whenever a call to that variable appears in the program, it will refer to this location. It is a good planning rule to use names that will allow us to refer unambiguously to specific memory locations where the relevant data has been stored.

# R names

In this post (The names in R) we analyze the rules to follow to correctly choose the names of constants, variables, methods, classes and modules, which represent the essential elements with which we will work in this environment.

A name in R can then consist of a capital letter, a lowercase letter or a symbol. (dot), which in turn can be any combination of uppercase and lowercase letters, and figures. The lower case characters correspond to the lowercase letters of the alphabet from a to z, while uppercase characters correspond to the uppercase letters of the alphabet from A to Z and the digits from 0 to 9. The number of characters that make up the name is not limited.

Here are some suggestions given in the guide to R provided by Google, on how to correctly name the objects (names in R):

Never use underscores (_) or dashes (-) to identify an object in the R environment.
The identifiers must be named according to the following conventions.
The preferred form for variable names is to use all lowercase letters and words must be separated by dots (variable.name), but the identifier in the variableValue form is also accepted.
Function names have an initial capital letter and no point (FunctionName) should be used.
The constants are identified in the same way as the functions, but with an initial k.

# Code indentation in R

Even if the structure of the R language foresees particular delimiters for some program blocks, it is still useful, the indentation of the code in R, for the relative identification.

In this regard, it should be recalled that the code indentation refers to the technique used in programming through which program blocks are highlighted by inserting a certain amount of empty space at the beginning of a line of text, with the aim of increasing readability.

Although, as already mentioned, R provides appropriate delimiters for some language structures, we will use the same indentation to indicate the nested blocks; in this regard it is possible to use both a tabulation and an arbitrary number of white spaces.

In using this technique it is necessary to remember simple recommendations:

• the number of spaces to use is variable;
• all instructions in the program block must have the same number of indentation spaces.

In this context we will use the convention that provides for the exclusive use of two spaces o identify a new block and to leave out the use of the tab.

# R script editor

In this post we will analyze why use the R script editors available on the network. To program with R, we can use any text editor and a simple command line interface. Both of these tools are already present on any operating system, so if we want, we can ignore this passage. R script editor

This is because, when a programmer writes a simple program, it does so using the text editor window of windows, this is because to make programs the fonts, colors and in general the graphic appearance are irrelevant, indeed they can make the work of the programmer more difficult.

This is the reason why in the software development environments are not used complex word processing programs, these programs are largely used by writers, but rather simple text editors (such as notepad or notepad in Windows or vi and emacs in linux environment).

Instead of complicated text visual management options, these editors provide advanced text processing features, such as fast text-based navigation procedures, word searches and substitutions within the file, and external files , recognition of key words of the programming language with the possibility of highlighting them by coloring them differently from the rest of the text, and finally the identification of the text.

# How to create a script in R

To create a script in R just use the File menu through the following sequence of commands:

`File => New script`

This will open an empty script in this way, it will be up to us now to populate it with a code of great value. Once we have entered some commands in the script window, it will need to be run in the console. Of course you can do the copy-paste, as usual, but it’s better to be decidedly quicker and more professional to select with the mouse the code you want to perform and press the Ctrl + R keys. Alternatively, you can also right-click in the window and select the Execute line or selection item. How to create a script in R

From this example comes a first consideration: typing the commands directly in the shell of R is not really the best solution. An effective way to interact with the shell is to create a script in R. In this method, you can write the code in a separate window and then you will be able to execute the code in the console, so that if you feel the need to save the code or run it several times you will not need to type it all again .