Python Machine Learning Cookbook Second Edition

Over 100 recipes to progress from smart data analytics to deep learning using real-world datasets.

The popular Python Machine Learning Cookbook, Second Edition, will enable you to adopt a fresh approach to dealing with real-world machine learning and deep learning tasks.
With the help of over 100 recipes, you will learn to build powerful machine learning applications using modern libraries from the Python ecosystem. The book will also guide you on how to implement various machine learning algorithms for classification, clustering, and recommendation engines, using a recipe-based approach. With an emphasis on practical solutions, dedicated sections in the book will help you to apply supervised and unsupervised learning techniques to real-world problems. Toward the concluding chapters, you will get to grips with recipes that teach you advanced techniques for fields including reinforcement learning, deep neural networks, and automated machine learning.

By the end of this book, you will be equipped, through real-world examples, with the skills you need to apply machine learning techniques, and will be able to leverage the full capabilities of the Python ecosystem.

Python Machine Learning Cookbook Second Edition


Understanding Regression Trees

Learn about regression trees in this tutorial by Giuseppe Ciaburro, a Ph.D. in environmental technical physics with over 15 years of experience in programming with Python, R, and MATLAB, in the field of combustion, acoustics, and noise control. 

Decision trees are used to predict a response (class y) from several input variables: x1, x2,…,xn. If y is a continuous response, it’s called a regression tree; if y is categorical, it’s called a classification tree. That’s why these methods are often called Classification and Regression Trees (CART). The algorithm checks the value of an input (xi) at every node of the tree and continues to the left or right branch based on the (binary) answer. When you reach a leaf, you will find the prediction.

The algorithm starts from grouped data into a single node (root node) and executes a comprehensive recursion of all possible subdivisions at every step. At each step, the best subdivision (the one that produces as many homogeneous branches as possible) is chosen.

In regression trees, you try to partition the data space into small enough parts where you can apply a simple yet different model on each part. The non-leaf part of the tree is just the procedure to determine for each data x the model you will use to predict it.

A regression tree is formed by a series of nodes that split the root branch into two child branches. Such subdivision continues to cascade. Each new branch can, then, go to another node or remain a leaf with the predicted value.

Starting from the whole dataset (root), the algorithm creates the tree through the following procedure:

  1. Identify the best functionality to divide the X1 dataset and the best s1 division value. The left-hand branch will be the set of observations where X1 is below s1, while the right-hand branch comprises the set of observations in which X1 is greater than or equal to s1.
  2. This operation is then recursively executed again (independently) for every branch until there is no possibility of division.
  3. When the divisions are completed, a leaf is created, which indicates the output values.

Suppose you have a variable response to only two continuous predictors (X1 and X2) and four division values (s1s2s3s4). The following figure proposes a way to represent the whole dataset graphically:

The goal of a regression tree is to encapsulate the whole dataset in the smallest possible tree. To minimize the tree size, the simplest possible explanation for a set of observations is preferred over other explanations. All this is justified by the fact that small trees are much easier to comprehend than large trees.

You saw how the regression tree algorithm works. These steps can be summarized in the following processes:

  • Splitting: The dataset is partitioned into subsets. The split operation is based on a set of rules, for example, sums of squares from the whole dataset. The leaf node contains a small subset of the observations. Splitting continues until a leaf node is constructed.
  • Pruning: In this process, the tree branches are shortened. The tree is reduced by transforming a few nodes of branches into leaf nodes and removing leaf nodes under the original branch. Care must be taken as the lower branches can be strongly influenced by abnormal values. Pruning allows you to find the next largest tree and minimize the problem. A simpler tree often avoids overfitting.
  • Tree selection: Finally, the smallest tree that matches the data is selected. This process is executed by choosing the tree that produces the lowest cross-validated error.

To fit a regression tree in R, you can use the tree() function implemented in the tree package. In this package, a tree is grown via binary recursive partitioning by using the response in the specified formula and choosing splits from the terms of the right-hand side. Numeric variables are divided into X < a and X > a. The split that maximizes the reduction in impurity is chosen, the dataset split and the process repeated. Splitting continues until the terminal nodes are too small or too few to be split. Take a look at the following table for basic information on this package:

Package tree
Date January 21, 2016
Version 1.0-37
Title Classification and Regression Trees
Author Brian Ripley

To perform a regression tree example, begin with the data. Use the mtcars dataset contained in the datasets package. You can extract the data from the 1974 Motor Trend US magazine. It comprises fuel consumption and ten aspects of automobile design and performance for 32 automobiles (1973–74 models). The mtcars dataset also contains gas mileage, horsepower, and other information for 32 vehicles. It is a data frame with 32 observations on the following 11 variables:

  • mpg: Miles per gallon
  • cyl: Number of cylinders
  • disp: Engine displacement (cubic inches)
  • hp: Engine horsepower
  • drat: Rear axle ratio
  • wt: Weight (1000lbs)
  • qsec: 1/4mile time
  • vs: V/S
  • am: Transmission (0 = automatic1 = manual)
  • gear: Number of forward gears
  • carb: Number of carburetors

The fuel consumption of vehicles has always been studied by major manufacturers of the entire planet. In an era characterized by oil refueling problems and even greater air pollution problems, fuel consumption by vehicles has become a key factor. In this example, you’ll build a regression tree with the purpose of predicting the fuel consumption of vehicles according to certain characteristics.

The analysis begins by uploading the dataset:


The dataset is contained in the datasets package; to load it, use the data() function. To display a compact summary of the dataset simply type:


The results are shown as follows:

> str(mtcars)'
data.frame':  32 obs. of  11 variables: 
$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... 
$ cyl : num  6 6 4 6 8 6 8 4 4 6 ... 
$ disp: num  160 160 108 258 360 ... 
$ hp  : num  110 110 93 110 175 105 245 62 95 123 ... 
$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... 
$ wt  : num  2.62 2.88 2.32 3.21 3.44 ... 
$ qsec: num  16.5 17 18.6 19.4 17 ... 
$ vs  : num  0 0 1 1 0 1 0 1 1 1 ... 
$ am  : num  1 1 1 0 0 0 0 0 0 0 ... 
$ gear: num  4 4 4 3 3 3 3 4 4 4 ... 
$ carb: num  4 4 1 1 2 1 4 2 2 4 ...

You have thus confirmed that these are 11 numeric variables with 32 observations. To extract more information, use the summary() function:

> summary(mtcars)      
      mpg             cyl             disp             hp       
Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
Median :19.20   Median :6.000   Median :196.3   Median :123.0  
Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0
     drat             wt             qsec             vs        
Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
      am              gear            carb      
Min.   :0.0000  Min.   :3.000   Min.   :1.000  
1st Qu.:0.0000  1st Qu.:3.000   1st Qu.:2.000  
Median :0.0000   Median :4.000  Median :2.000   
Mean   :0.4062   Mean   :3.688   Mean   :2.812  
3rd Qu.:1.0000  3rd Qu.:4.000   3rd Qu.:4.000  
Max.   :1.0000   Max.   :5.000   Max.   :8.000

Before starting with data analysis, conduct an exploratory analysis to understand how the data is distributed and extract preliminary knowledge. First, try to find out whether the variables are related to each other. You can do this using the pairs() function to create a matrix of sub-axes containing scatter plots of the columns of a matrix. To reduce the number of plots in the matrix, limit your analysis to just four predictors: cylinders, displacement, horsepower, and weight. The target is the mpg variable that contains the miles per gallon of 32 sample cars:


To specify the response and predictors, the formula argument is used. Each term gives a separate variable in the pairs plot, so terms must be numeric vectors. The response is interpreted as another variable, but not treated specially. The following figure shows a scatter plot matrix:

By observing the plots in the first line, it can be noted that fuel consumption increases as the number of cylinders, the engine displacement, the horsepower, and the weight of the vehicle increases.

At this point, you can use the tree() function to build the regression tree. First, install the tree package. To install a library that is not present in the initial distribution of R, you must use the install.package function. This is the main function used to install packages. It takes a vector of names and a destination library, downloads the packages from the repositories, and installs them.

Now, load the library through the library command:


You can use the tree() function that builds a regression tree:

RTModel <- tree(mpg~.,data = mtcars)

Only two arguments are passed—a formula and the dataset name. The left-hand side of the formula (response) should be a numerical vector when a regression tree is fitted. The right-hand side should be a series of numeric variables separated by +; there should be no interaction terms. Both . and – are allowed; regression trees can have offset terms.

Here are the results:

> RTModel
node), split, n, deviance, yval
      * denotes terminal node  

1) root 32 1126.000 20.09    
  2) wt < 2.26 6   44.550 30.07 *   
  3) wt > 2.26 26  346.600 17.79      
    6) cyl < 7 12   42.120 20.92   
    12) cyl < 5 5    5.968 22.58 *  
    13) cyl > 5 7   12.680 19.74 *     
  7) cyl > 7 14   85.200 15.10       
    14) hp < 192.5 7   16.590 16.79 *      
    15) hp > 192.5 7   28.830 13.41 *

These results describe exactly each node in the tree. Information on each node is presented in an indented format. It is used to indicate the tree topology; that is, it indicates the parent and child relationships (also referred to as primary and secondary splits). Also, to denote a terminal node, an asterisk (*) is used.

In the tree sequence, nodes are labeled with unique numbers. These numbers are generated by the following formula: the child nodes of a node x are always numbered 2*x (left child) and 2*x+1 (right child). The root node is numbered as one. The following figure explains this rule:

From the analysis of the results, you can see a selection of variables; in fact, between the ten available variables, only three—wt, cyl, and hp—were selected. More information can be obtained from the summary() function:

> summary(RTModel) 

Regression tree:
tree(formula = mpg ~ ., data = mtcars)
Variables actually used in tree construction:
[1] "wt"  "cyl" "hp"
Number of terminal nodes:  5
Residual mean deviance:  4.023 = 108.6 / 27
Distribution of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 -4.067  -1.361   0.220   0.000   1.361   3.833

The output of summary() indicates that only three of the variables have been used in constructing the tree. In the context of a regression tree, the deviance is simply the sum of squared errors for the tree. Now, you can plot the regression tree:


The first one plots the regression tree, while the second one adds the text on the branches to explain the workflow. The resulting plot is shown in the following figure:

Now look at what the regression tree has returned. The first thing that seems obvious is a sort of indication of the importance of variables. The choice of three predictors for the ten available variables already makes you realize that these three are the ones that most affect the fuel consumption of cars inserted in the dataset.

Now, you can add that the most important predictor is the weight of the vehicle; in fact, a weight less than 2.26 lbs leads you to a terminal knot, which gives a consumption estimate (30.07 miles/(US) gallon). You can then see this immediately after you find the number of cylinders of the engine and the horsepower.

If you found this article interesting, you can explore Giuseppe Ciaburro’s Regression Analysis with R to build effective regression models in R to extract valuable insights from real data. This book will give you a rundown explaining what regression analysis is, explaining to you the process from scratch.

Working with Date Objects in R

Learn how to work with date objects in R in this tutorial by Kuntal Ganguly, a big data analytics engineer focused on building large-scale, data-driven systems using big data frameworks and machine learning.

The base R package provides date functionality. This article will show you several date-related operations in R. You’ll only be using features from the base package and not from any external data. Therefore, you do not need to perform any preparatory steps.

R internally represents dates as the number of days from January 1, 1970. It represents dates as Unix time or epoch time, which is defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970. So, zero corresponds to January 1, 1970 and so on. You can convert positive and negative numbers to dates. Negative numbers give dates before January 1, 1970.

  1. Get started with today’s date:
> Sys.Date()
  1. Create a date object from a string:
# Supply year as two digits 

# Note correspondence between separators in the date string and the format string 

as.Date("1/1/80", format = "%m/%d/%y")  
[1] "1980-01-01"  

# Supply year as 4 digits 

# Note uppercase Y below instead of lowercase y as above 

as.Date("1/1/1980", format = "%m/%d/%Y") 
[1] "1980-01-01" 

# If you omit format string, you must give date as "yyyy/mm/dd" or as "yyyy-mm-dd" 

as.Date("1970/1/1") [1] "1970-01-01"  > as.Date("70/1/1") 
[1] "0070-01-01"
  1. Use other options for separators (this example uses hyphens) in the format string, and also see the underlying numericvalue:
> dt <- as.Date("1-1-70", format = "%m-%d-%y") 
> as.numeric(dt)   
[1] 0
  1. Explore other format string options:
> as.Date("Jan 15, 2015", format = "%b %d, %Y") 
[1] "2015-01-15"  
> as.Date("January 15, 15", format = "%B %d, %y") 
[1] "2015-01-15"
  1. Create dates from numbers by typecasting:
>dt <- 1000 
> class(dt) <- "Date" 
> dt                 # 1000 days from 1/1/70 
[1] "1972-09-27"  
dt <- -1000 
> class(dt) <- "Date" 
dt                 # 1000 days before 1/1/70 
[1] "1967-04-07"
  1. Create dates directly from numbers by setting the origin date:
> as.Date(1000, origin = as.Date("1980-03-31")) 
[1] "1982-12-26"  
> as.Date(-1000, origin = as.Date("1980-03-31")) 
[1] "1977-07-05"
  1. Examine the date components:
> dt <- as.Date(1000, origin = as.Date("1980-03-31/")) 
> dt 
[1] "1982-12-26"  
> # Get year as four digits 
> format(dt, "%Y") 
[1] "1982"  
> # Get the year as a number rather than as character string 
> as.numeric(format(dt, "%Y")) 
[1] 1982  
> # Get year as two digits 
> format(dt, "%y") 
[1] "82"  
> # Get month 
> format(dt, "%m")  
[1] "12"  
> as.numeric(format(dt, "%m"))  
[1] 12  
> # Get month as string 
> format(dt, "%b") 
[1] "Dec"  
> format(dt, "%B") 
[1] "December"  
> months(dt)  
[1] "December"  
> weekdays(dt) 
[1] "Sunday"  
> quarters(dt) 
[1] "Q4"  
> julian(dt) 
[1] 4742 
[1] "1970-01-01"  
> julian(dt, origin = as.Date("1980-03-31/")) 
[1] 1000 
[1] "1980-03-31"

How this works

Step 1 shows how to get the system date. Steps 2 through 4 show how to create dates from strings. You can see that by specifying the format string appropriately, you can read dates from almost any string representation. You can use any separator as long as you mimic them in the format string. The following table summarizes the formatting options for the components of the date:

Format Specifier Description
%d Day of month as a number, for example, 15
%m Month as a number, for example, 10
%b Abbreviated string representation of a month, for example, Jan
%B Complete string representation of a month, for example, January
%y Year as two digits, for example, 87
%Y Year as four digits, for example, 2001

Step 5 shows how an integer can be typecast as a date. Step 6 shows how to find the date with a specific offset from a given date (origin). Finally, step 7 shows how to examine the individual components of a date object using the format function along with the appropriate format specification (refer to the preceding table) for the desired component.

Step 6 also shows the use of the months, weekdays, and julian functions for getting the month, day of the week, and the Julian date corresponding to a date. If you omit the origin in the julian function, R assumes January 1, 1970, as the origin.

Operating on date objects

R supports many useful manipulations with date objects, such as date addition and subtraction, and the creation of date sequences. This example shows many of these operations in action. The base R package provides the date functionality, and you do not need any preparatory steps.

  1. Perform addition and subtraction of days from date objects:
> dt <- as.Date("1/1/2001", format = "%m/%d/%Y") 
> dt 
[1] "2001-01-01"  

> dt + 100                 # Date 100 days from dt
[1] "2001-04-11"  

> dt + 31 
[1] "2001-02-01"
  1. Subtract date objects to find the number of days between two dates:
> dt1 <- as.Date("1/1/2001", format = "%m/%d/%Y") 
> dt2 <- as.Date("2/1/2001", format = "%m/%d/%Y") 
> dt1-dt1  Time difference of 0 days  > dt2-dt1  

Time difference of 31 days 
> dt1-dt2  

Time difference of -31 days 
> as.numeric(dt2-dt1) 
[1] 31
  1. Compare the date objects:
> dt2 > dt1 
[1] TRUE  

> dt2 == dt1 
  1. Create date sequences:
> d1 <- as.Date("1980/1/1") 
> d2 <- as.Date("1982/1/1") 
> # Specify start date, end date and interval 
> seq(d1, d2, "month")  
[1] "1980-01-01" "1980-02-01" "1980-03-01" "1980-04-01"  
[5] "1980-05-01" "1980-06-01" "1980-07-01" "1980-08-01"  
[9] "1980-09-01" "1980-10-01" "1980-11-01" "1980-12-01" 
[13] "1981-01-01" "1981-02-01" "1981-03-01" "1981-04-01" 
[17] "1981-05-01" "1981-06-01" "1981-07-01" "1981-08-01" 
[21] "1981-09-01" "1981-10-01" "1981-11-01" "1981-12-01" 
[25] "1982-01-01"  

> d3 <- as.Date("1980/1/5") 
> seq(d1, d3, "day") 
[1] "1980-01-01" "1980-01-02" "1980-01-03" "1980-01-04" 
[5] "1980-01-05"  

> # more interval options 
> seq(d1, d2, "2 months")  
[1] "1980-01-01" "1980-03-01" "1980-05-01" "1980-07-01"  
[5] "1980-09-01" "1980-11-01" "1981-01-01" "1981-03-01"  
[9] "1981-05-01" "1981-07-01" "1981-09-01" "1981-11-01" 
[13] "1982-01-01"  

> # Specify start date, interval and sequence length 
> seq(from = d1, by = "4 months", length.out = 4 ) 
[1] "1980-01-01" "1980-05-01" "1980-09-01" "1981-01-01"
  1. Find a future or past date from a given date, based on an interval:
> seq(from = d1, by = "3 weeks", length.out = 2)
[2] [1] "1980-01-22"

How this works

Step 1 shows how you can add and subtract days from a date to get the resulting date. Step 2 shows how you can find the number of days between two dates through subtraction. The result is a difftime object that you can convert into a number if needed.

Step 3 shows the logical comparison of dates and step 4 shows two different ways of creating sequences of dates. In one, you specify the from date, the to date, and the fixed interval by between the sequence elements as a string. In the other, you specify the from date, the interval, and the number of sequence elements you want. If you’re using the latter approach, you have to name the arguments.

Finally, Step 5 shows how you can create sequences by specifying the intervals in a flexible manner.

If you found this article interesting, you can explore Kuntal Ganguly’s R Data Analysis Cookbook – Second Edition to put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. This book has over 80 recipes to help you breeze through your data analysis projects using R.

Why Use Functions in Python?

This guest post by Fabrizio Romano, the author of Learn Python Programming – Second Edition, explores why functions are integral to developing applications in Python.

Functions are among the most important concepts and constructs of any language; here are a few reasons:

  • They reduce code duplication in a program. By having a specific task taken care of by a nice block of packaged code that you can import and call whenever you want, you don’t need to duplicate its implementation.
  • They help in splitting a complex task or procedure into smaller blocks, each of which becomes a function.
  • They hide the implementation details from their users.
  • They improve traceability.
  • They improve readability.

Let’s look at a few examples to get a better understanding of each point.

Reducing code duplication

Imagine that you are writing a piece of scientific software, and you need to calculate primes up to a limit. You have an algorithm to calculate them, so you copy-paste it to wherever you need. One day, though, your friend, B. Riemann, gives you a better algorithm to calculate primes, which will save you a lot of time. At this point, you need to go through your entire code base and replace the old code with the new one.

This is, in fact, a bad way of going about it. It’s error-prone as the process entails the risks of faulty code replacement and deletion of crucial code parts, making your algorithm inconsistent and unstable. What if, instead of replacing code with a better version of it, you need to fix a bug, and you miss one of the places? That would be even worse.

So, what should you do? Simple! You write a function, get_prime_numbers(upto), and use it anywhere you need a list of primes. When Riemann comes to you and gives you the new code, all you have to do is replace the body of that function with the new implementation, and you’re done! The rest of the software will automatically adapt, since it’s just calling the function.

Your code will be shorter as it will not suffer from inconsistencies or undetected bugs due to copy-and-paste failures or oversights.

Splitting a complex task

Functions are also very useful for splitting long or complex tasks into smaller ones. The end result is that the code benefits from it in several ways, for example, readability, testability, and reuse.

To give you a simple example, imagine that you’re preparing a report. Your code needs fetch, parse, filter, and polish data, and then perform an entire series of algorithms against the data to generate a report. It isn’t uncommon to read procedures that are just one big do_report(data_source) function. There are tens or hundreds of lines of code that end with return report.

These situations are slightly more common in scientific code; they tend to be brilliant from an algorithmic point of view but lack the touch of experienced programmers when it comes to the style. Now, picture a few hundred lines of code. It’s very hard to follow through to find the places where things are changing context (such as finishing one task and starting the next one). Do you have the picture in your mind? Good. Don’t do it! Instead, look at this code:

def do_report(data_source):

    # fetch and prepare data

    data = fetch_data(data_source)

    parsed_data = parse_data(data)

    filtered_data = filter_data(parsed_data)

    polished_data = polish_data(filtered_data)

    # run algorithms on data

    final_data = analyse(polished_data)

    # create and return report

    report = Report(final_data)

    return report

The previous example is fictitious, of course, but can you see how easy it would be to go through the code? If the end result looks wrong, debugging each of the single data outputs in the do_report function will be extremely easy. Moreover, it’s even easier to exclude parts of the process temporarily from the whole procedure (you just need to comment out the parts you need to suspend).

Hiding implementation details

Refer to the preceding example to appreciate this merit as well. As you can see, by going through the code of the do_report function, you can get a pretty good understanding without reading one single line of implementation.

This is because functions hide the implementation details, meaning you don’t need to delve into the details if you don’t want to, as against the case where do_report were just one big, fat function. This reduces the time you spend reading the code. Since, reading code takes longer than actually writing it in a professional environment, it’s very important to reduce it by as much as we can.

Improving readability

Coders sometimes don’t see the point in writing a function with a body of one or two lines of code, so let’s look at an example that shows you why you should do it.

Imagine that you need to multiply two matrices:

Take a look at the following two code snippets and decide for yourselves which one is easier to read:


a = [[1, 2], [3, 4]]

b = [[5, 1], [2, 1]]

c = [[sum(i * j for i, j in zip(r, c)) for c in zip(*b)]

     for r in a]


# this function could also be defined in another module

def matrix_mul(a, b):

    return [[sum(i * j for i, j in zip(r, c)) for c in zip(*b)]

            for r in a]

a = [[1, 2], [3, 4]]

b = [[5, 1], [2, 1]]

c = matrix_mul(a, b)

In the second example, it’s much easier to understand that c is the result of the multiplication of a and b. Reading through the code is also easy and, if you don’t need to modify the multiplication logic, you don’t even need to go into the implementation details. On the other hand, in the first snippet, you would have to spend a lot of time trying to understand what that complicated list comprehension is doing.

Improving traceability

Imagine that you have written an e-commerce website. You have displayed the product prices all over the pages. Imagine that the prices in your database are stored with no VAT, but you want to display them on the website with VAT at 20%. Here are a few ways of calculating the VAT-inclusive price from the VAT-exclusive price:


price = 100  # GBP, no VAT

final_price1 = price * 1.2

final_price2 = price + price / 5.0

final_price3 = price * (100 + 20) / 100.0

final_price4 = price + price * 0.2

All these four ways of calculating a VAT-inclusive price are perfectly acceptable. Now, imagine that you have started selling your products in different countries and some of them have different VAT rates. This means you’ll have to refactor your code throughout the website in order to make the VAT calculation dynamic.

How would you trace all the places in which you are performing a VAT calculation? Coding is generally a collaborative task and you cannot be sure that the VAT has been calculated using only one of those forms, can you?

Here’s what you do. Write a function that takes the input values, vat and price (VAT-exclusive), and returns a VAT-inclusive price:


def calculate_price_with_vat(price, vat):

    return price * (100 + vat) / 100

Now you can import this function and use it throughout your website to calculate a VAT-inclusive price, and when you need to trace those calls, you can search for calculate_price_with_vat.

Now that you’ve understood why functions are so important, explore Learn Python Programming – Second Edition to understand the nuances of Python programming to develop efficient, stable and high-quality applications. The book is replete with real-world examples that will make the fundamentals of Python programming a piece of cake and is hence a must-have for all beginners.

Two Simple Animation Techniques with R

Author: Omar Trejo Navarro

You know how people say “A picture is worth a thousand words”? Well, an animation can be worth even more! When trying to understand dynamics or illustrate concepts, animations can be very powerful. Lucky for us, R can be easily used to create them. Yes, that’s right! We don’t have to use any external software to create nice animations; we can simply use the tools we already know and love. Get ready to impress your friends and colleagues using them!

In this post I’ll show you two different ways to produce animations with the animation and gganimate packages. As we will see, they require different ways of thinking about animations, which them naturally handle different scenarios, but you can do most practical animations with any of them. If you want to dig deeper in to the subject, I’d recommend looking into the great plotly package as well.

If you find this post interesting, you may want to look into my recent book “R Programming by Example” published by Packt. If you do, please publish your opinion about it in the book’s page. I really appreciate it, and every comment can go a long way. Also, if you have any questions or feedback, please don’t hesitate to contact me through my website.

What will we create?

A common statistics programming exercise is to find whether the confidence intervals associated to different random samples contain the real value for a parameter used to generate them. In this post we will take on the most common case: random samples from a normal distribution with known variance and unknown mean. This unknown mean is a value we actually known but will pretend not to, to find out whether a confidence interval contains it.

As you may know, when creating confidence intervals, you must specify a degree of confidence which implies the actual quantile in the distribution that will be used to generate the intervals, which in turn implies their width. This is intuitive: the more confident you want to be that a value is within an interval, the wider such interval should be. In this case we will use 95% confidence intervals, meaning that for every 100 confidence intervals we generate, in average, 5 of them should contain the “real” mean value used to generate the random samples.

To make sure the code for the animations is clear, we will separate the generation of the random samples and their corresponding confidence intervals from the animations. In the animation, if the confidence interval contains the “real” mean, it will be gray. If it doesn’t, it will be red to make it stand out. You can see below an image of the animation once everything is being shown.

Note that the animation will be different when you run the code yourself because it depends on the parameters used as well as the samples generated, which are not being controlled in this example and will surely be different for you. If you want to get the same animations every time you run the code for an example such as this one, make sure you use set.seed(12345) (with an integer of your preference) before executing the code.

Generating our confidence intervals

We wil treat a couple of variables as globals. SD , MEAN , N_S , and N_CI are global parameters for the standard deviation, mean, number of observations in the random samples, and number of confidence intervals (which is a one-to-one relation with the number of random samples), respectively.

SD <- 1
MEAN <- 0
N_S <- 100
N_CI <- 50

We define the confidence_interval() to use the globals mentioned before to produce a normal random sample, using the rnomr() function, with N_S observations, MEAN mean, and SD standard deviation. Our confidence intervals will be tested against the condition that they contain the MEAN value. Once we get the mean for our current random sample as m , we can
compute the boundaries for its confidence interval by adding and subtracting the value for the corresponding critical value and adjusting it with sample size.

If we did not assume we knew the standard deviation, we would have to resort to another formula which makes use of the tStudent distribution, instead of the normal distribution. Also if we wanted 90% confidence intervals, we would use a 1.64 critical value instead, but you probably know all of that, so I’ll just focus on the code.

Some of you may not know why we need the three dots ( … ) in the function signature. Those three dots are called ellipsis, and they are there so that the function can accept an arbitrary number of arguments. We need them because we intend to use our function within an lapply() call, which sends the parameter by default (the current element in an iteration), and if we are not able to receive it as an argument to our confidence_interval() function we would get an error. We could just as easily have used a name for such paremeter, but I think this is more elegant instead of having an explicit parameter which is unused, and may confuse future readers/users of our code.

confidence_interval <- function(...) {
m <- mean(rnorm(N_S, MEAN, SD))
low <- m - 1.96 / sqrt(N_S)
high <- m + 1.96 / sqrt(N_S)
return(c(low, high))

The CIS object will contain a list of vectors, where each vector contains the confidence interval boundaries. We create it by simply running the confidence_interval() function N_CI times. Since our code is very simple and is making use of globals, we don’t need to pass it any parameters. Keep in mind that there’s a parameter that is actually being passed to the confidence_interval() function, which is the current value of the 1:N_CI list, but it will be caught in the ellipsis and will
not be used.

CIS <- lapply(1:N_CI, confidence_interval)

To take a look at our confidence intervals, we can simply print the CIS object. In this case we see that the first confidence interval goes from X to X, the second from X to X, and so on. We are going to use these values to generate our animations in the following sections. As you can see, these values correspond to the first three confidence intervals in the image shown above, and all of them contain the MEAN used which was 0 , so they appear in gray.

[1] -0.1217169 0.2502831
[1] -0.2287392 0.17326079
[1] -0.2311497 0.3208503

In the following sections we will see how to generate the actual animations that wil show each confidence interval one by one.

Using the animation package

The first animation we generate will be with the animation package, which was developed by Yihui Xie, a Statistics PhD from Iowa State University. It works by capturing different frames of a graph. Before we actually show the code for our animation, here’s a simple example that will create 10 frames using a for loop, where each iteration will plot a uniform random sample. As you can see, to generate animations with this package we create each frame separately and the saveGIF function, which receives an expression with our code inside, will take those images and stich them together into an “animation”.
Given this information you may realize that if you want a “smoother” animation, you will  probably want to create more frames and reduce the amount of movement among them, which would be correct. Unfortunately, you may run into problems when increasing the number of frames, as I have. You’ll have to explore what fits your use case.

for (i in 1:10) plot(runif(10), ylim = 0:1)

Now that you understand the basic mechanics of how animations are generated with this package, we will look into the code for our confidence intervals animation. The first thing to note is that we import the file we created in the previous section ( animations-with-R-base.R ) to make sure we have the CIS , MEAN , and N_CI objects available in memory. Next we load the animation package, and specify that we want .25 seconds between each frame in the animation, which is equivalent to 4 FPS (frames per second). Since we will create an animation with 50 confidence intervals, the total time for our animation will then be 12.5 seconds.

We send our expression to the saveGIF() function, in which we use a for loop as we had done before, and for each confidence interval, we create an empty plot (note the 0 in the first parameter), with N_CI values in the x-axis, and some values or the y-axis that will allow us to see properly the min/max values in our confidence intervals ([-0.6, 0.6] in this case).

We also provide some labels, as well as a horizontal grey line at the MEAN level which is 0 .
Next we extract the necessary values for each frame. These values will be composed of all confidence interval values up to i . Since i can be thought of as the frame we’re currently working with, this would mean that we get the confidence intervals from the first up to the one for the current frame. This will allow use to produce an effect where we add one by one
to the animation, but in reality were are getting the first i elements each time, not just adding the latest one. The actual extraction is done with a combination of the lapply() and unlist() functions, as well as an anonymous function that simply returns the corresponding value from the tuple for each confindence interval. This is done so that we have the values boundary values for the confidence in two separate vectors, which is required by the segments() function to actually show the lines in the graph.

The only thing missing is the color. In this case we apply a similar extraction mechanism as the one described in the previous paragraph, but instead of simply returning a value from a vector, we check whether all values are larger or smaller than the MEAN , in which case the confidence interval does not contain the MEAN value, and thus “red” is returned. Otherwise, “gray” is returned. The color object ends up being a list that has one element for each element in the x object (all other objects there also contain the same number of values). Finally we specify the line width to be a bit thicker with lwd = 3 .

ani.options(interval = .25)
for (i in 1:N_CI) {
xlim = c(1, N_CI),
ylim = c(-0.6, 0.6),
xlab = "Sample",
ylab = "Confidence Interval",
main = "Confidence Interval Animation"
abline(h = MEAN, col = "gray", lwd = 2)
x = 1:i
y1 = unlist(lapply(CIS[1:i], function(ci) ci[[1]]))
y2 = unlist(lapply(CIS[1:i], function(ci) ci[[2]]))
color = unlist(lapply(CIS[1:i], function(ci) {
if (all(ci > MEAN) || all(ci < MEAN)) {
segments(x, y1, x, y2, color, lwd = 3)

When we execute the previous code, we can see the our animation, which is shown below. Pretty cool, isn’t it? You may want to look into the package’s examples to find more interesting use cases. You should also know that you can export as HTML, SWF, and even Latex files, if you so desire. Personally, I’ve found GIF to be portable enough to be used wherever I need to.

Using the gganimate package

Now we are going to create a similar animation, but we will use the gganimate package in this case, which was developed by David Robinson, Chief Data Scientist at DataCamp, as well as a Computational Biology PhD from Princeton, and book author. This package was designed to work with ggplot2 and use data frames to as input. It extends the the aes() function (aesthetics), used by the ggplot() function, to accept frame and cumulative paremeters. Those paremeters
indicate which observations should be used in which frames, and whether the process should be accumulative. If this last parameter is set to TRUE , we don’t need to manually accumulte objects in the graph as we did with the animate package.

As you can see, we need to import both the ggplot2 and gganimate packages. Then we create our data frame with a similar structure to the one we had in separate vectors in the previous example. Naturally we have one observation for each frame in our animation. Sometimes this will not come naturally and you’ll need for data preparation, but this case is simple in this regard.

Next we create a graph we normally would with the ggplot() function, but we specify two new parameters in the aes() function: frame and cumulative . The frame parameter lets gganimate know which variable indicates the data that should be used for each frame, and cumulative produces the behavior mentioned above. Everything else is standard ggplot2 code. As we did before, we add a horizontal segment, as well as a segment for each confindence interval, we
specify the colors, and remove the legend because it’s not useful in this case.

data <- data.frame(
x1 = 1:N_CI,
x2 = 1:N_CI,
y1 = unlist(lapply(CIS, function(ci) ci[[1]])),
y2 = unlist(lapply(CIS, function(ci) ci[[2]])),
color = unlist(lapply(CIS, function(ci) {
if (all(ci > MEAN) || all(ci < MEAN)) {

plot <- ggplot(data, aes(color = color, frame = x1, cumulative = TRUE)) +
aes_string(x = 0, xend = 50, y = 0, yend = 0), color = "gray"
) +
aes(x = x1, xend = x2, y = y1, yend = y2), size = 1.5
) +
scale_color_manual(values = c("gray", "red")) +
theme(legend.position = "none")

If you want to see the final frame for the animation (where all the accumulation has been done), you can do so with simply printing the object (as you would with any ggplot2 graph). This is very helpful because you can see that the final result is what you expect, without producing the animation.


To actually see the animation, you need to pass the plot object to the gganimate() function which will generate it and show it to you. The result is shown below.




In this post we have seen two different ways to create animations within R. They operate in slightly different ways, and they may up being more intuitive to use in different situations. I prefer working with gganimate rather than with animate because it uses tools that I normally find very intuitive ( ggplot2 and data frames)q, but also because it doesn’t require an explicit
for loop to create each frame, you can simply prepare the data and create a normal gpplot() graph specifying which observations should be used in each frame. I find this approach very intuitive, but I know people who find the explicit looping approach more intuitive, which one do you find more interesting? Let me know in the comments!
Now that you have the fundamentals for animations with R, imagine all the possibilities. All those times that your colleagues were confused about what you were thinking, and you said to yourself “If only I could show them this with an animation, it

R Programming by Example

R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Given the obvious advantages that R brings to the table, it has become imperative for data scientists to learn the essentials of the language.

R Programming by Example is a hands-on guide that helps you develop a strong fundamental base in R by taking you through a series of illustrations and examples. Written by Omar Trejo Navarro, the book starts with the basic concepts and gradually progresses towards more advanced concepts to give you a holistic view of R. Omar is a well-respected data consultant with expertise in applied mathematics and economics.

If you are an aspiring data science professional or statistician and want to learn more about R programming in a practical and engaging manner, R Programming by Example is the book for you!

Python online help

To consult the Python online help we need to type the help () command to receive information on the use of the python interpreter. After issuing the command, you receive a welcome message from the online help function that invites us in case we were beginners to see the tutorial available on the Internet at the following url:

>>> help()

Welcome to Python 3.6's help utility!

If this is your first time using Python, you should definitely check out
the tutorial on the Internet at

Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules. To quit this help utility and
return to the interpreter, just type "quit".

To get a list of available modules, keywords, symbols, or topics, type
"modules", "keywords", "symbols", or "topics". Each module also comes
with a one-line summary of what it does; to list the modules whose name
or summary contain a given string such as "spam", type "modules spam".

In the online help, as already mentioned with the help () command, just enter the name of any module, keyword or topic to get help in drafting Python programs. To exit the online guide to return to the interpreter, simply type “quit”. Instead, to obtain a list of available modules, expected keywords or usable topics, you will need to type “modules”, “keywords”, or “topics”.

Each module then has a summary contained in the online help in which all its features are listed, while to list the modules whose synthesis contains a given word we will have to add it to the word modules. For example, to get information on the array module we will insert this name in the shell of the online help to get the information shown in the figure.

Python online help

It is therefore advisable to consult the online help of the Python interpreter whenever you find yourself having to use a Python resource that you do not know adequately; we have seen that through the help of the Python Interactive Shell it will be easy and immediate to obtain sufficient documentation.



Python Interactive Shell

To start Python or more correctly to open the Python Interactive Shell just click on the Start menu to find the shell icon ready in the frequently used programs, or just click on All Programs then on the entry Active State Active Python 3.1.

After activating the Python Interactive Shell, we will find a window where you can type our python instructions from the command line. Let’s start with the classic message that programmers use to send to the shell to test its regular operation; I refer to the most classic of messages: “Hello World”.

To display a message from the shell it will be necessary to print it, and then what could be the command that allows us to do this if not print (which translates into Italian means printing), this confirming the fact that reading the Python code is equivalent to reading a common listing in English.

Then to display the message “Hello World” at the command prompt, just type the following statement:

print ('Hello World')

to obtain the printout of the message as shown in the figure.

python prompt

Having done this, let’s see now how to receive a first and immediate help from the Python interactive shell; in fact, when the shell opens, the following message is displayed:

Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.



How to define strings in Python

A string is identified through the use of quotation marks. In the definition of a string in Python both single and double quotes can be used: “string”.

This double possibility of using the identifier allows us to include a type of quotation marks in a string enclosed by quotation marks of the other type, which is indispensable for example in the use of the apostrophe.

To better understand what has been said, let’s see an example:

>>> print “String containing the apostrophe …”

How to define strings in Python

Let’s see in detail how many ways it is possible to define a string:

  •    With single quotes: ‘String in Python!’
  •    With double quotes: “String in Python!”
  •    Escape sequences: “Type the command \” ls \ “”
  •    Through the use of transformation functions from other types of values: str (1111), str (11.11), str (11.1 + 1j)
  •    Finally the multiline strings with the adoption of three quotes (single or double).

To deep this argument:

Arithmetic operators in Python

In this post, let’s see how to perform simple calculations in the Python environment. Suppose we want to use the Python prompt as a simple calculator, then we will write:

>>> 6 + 5

Here we have visualized the result of an arithmetic operation: the sum of five and six. Python recognizes the numbers and the addition sign and adds them. Then show the result.

All arithmetic operators can be used:

  • addition (+)
  • subtraction (-)
  • multiplication (*)
  • division (/)

We can then combine different operations to get multiple expressions:

>>> ((5 * 4) + (6 – 3)) / (1 + 4)

In the expression we have just seen, it is possible to notice how Python uses parentheses to perform operations on numbers, these determine variations in the order in which they are performed.

Arithmetic operators in Python

Let’s see what happens if you write the same sequence without the brackets:

>>> 5 * 4 + 6 – 3/1 + 4

As you can see the result is quite different and depends on the fact that Python calculates the multiplications and divisions before the sums and subtractions, according to what is dictated by the rules imposed by the algebra.

These rules are imposed by all programming languages ​​and are used to determine the evaluation sequence of operations that goes by the name of precedence between operators.

Comments in Python

Commenting on Python is an operation that in some ways is very different from other languages, but it is quite easy to get used to this new way of inserting the explanatory text in our codes.

In Python there are basically two ways to comment on a program:

  •   single line of comment
  •   multiple line of comment

The single comment line is used to insert a short comment (or for debugging), while the multiple comment line is often used to describe something much more detailed.

Let’s see then some explanatory examples to better understand the concepts introduced up to now. Let’s start with the single line of comment:

print ("This is not a comment")
#print ("This is a comment")

Then when the interpreter encounters the symbol # (hash) ignores everything following the symbol until the end of the line. We could also write like this:

print ("This is not a comment") # Printing a text string

As for the multiple comment line instead we will use the symbol “‘, let’s see how:

" '
print ("This is not a comment")
print ("Additional comment line")
" '
print ("This is not a comment")

The comments are a useful resource for the programmer because they allow us to insert very valuable explanatory text in the optics reusability, but they are also particularly effective in the code debugging phase where the insertion of text strings helps in identifying possible bugs in our program.