Info

Computers aren't the thing. They're the thing that gets you to the thing.

Posts tagged research operations

Originally published July 22, 2016 (github -> https://github.com/jabus/givR)

How do you audit the objects in your R data analysis pipeline?

Given a dataframe object and a series of piped operations, (how) can you observe the intermediate objects that result from each operation in the pipeline?

First, prepare your analysis environment by loading some useful libraries:

library(dplyr) # load power packages
library(magrittr)
library(tidyr)

Here’s an example pipeline:

iris %>%
  mutate(Sepal.Area = Sepal.Width * Sepal.Length) %>%
  mutate(Petal.Area = Petal.Width * Petal.Length) %>%
  filter(Sepal.Area > mean(Sepal.Area)) %>%
  select(-Petal.Length, -Sepal.Length) %>%
  group_by(Species) %>%
  summarize(n()) -> iris2

print(iris2)
## Source: local data frame [3 x 2]
## 
##      Species   n()
##        
## 1     setosa    20
## 2 versicolor    16
## 3  virginica    35

You can inspect the dataframes that enter and exit the pipeline (using the print command, for example), but how can you observe the intermediate objects generated along the pipeline? This is the challenge I explore in this blog post.

Line by line

You can highlight and source (or copy and paste) each section of the pipeline and observe the resulting object. This is okay for development, but is ephemeral and not very useful in research operations.

Assignment statements

You can use the assign statement to copy the intermediate objects, and observe those copied instances later, e.g.:

iris %>%
  mutate(Sepal.Area = Sepal.Width * Sepal.Length) %>%
  assign(x="x1",value=., pos=1) %>%
  mutate(Petal.Area = Petal.Width * Petal.Length) %>%
  assign(x="x2",value=., pos=1) %>%
  filter(Sepal.Area > mean(Sepal.Area)) %>%
  assign(x="x3",value=., pos=1) %>%
  select(-Petal.Length, -Sepal.Length) %>%
  assign(x="x4",value=., pos=1) %>%
  group_by(Species) %>%
  assign(x="x5",value=., pos=1) %>%
  summarize(n()) -> iris2


print(head(x1))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06

This works, although the alternating assignment lines along the pipeline feels a bit repetitive.

Building an audit trail

Here, I build on the above idea, by copying and collecting intermediate dataframes (objects) along the pipeline, into an audit trail.

There are two steps to building an audit trail. First, you initiate the audit trail by instantiating an audit object with the start_audit() function.

Second, you wrap any (all) piped operations in your pipeline with the a_() function. This takes a snapshot of the dataframe entering that piped operation and saves it to your audit trail.

After the pipe has completed, you can inspect the intermediate dataframes captured in your audit trail using the audit_trail() function.

Here are the function definitions for building and using an audit trail. (I may package these into an R package at some point in time.)

#function definitions

start_audit <- function() { #initialize (reset) the node counter to zero
  audit <- new.env()
  audit$node <- 0
  return(audit)
}

a_ <- function(x, f, ...) { #main functionality is captured with a_
  audit$node <<- audit$node + 1 
  audit[[as.character(audit$node)]] <- x   x %>% f(...)
}

audit_trail <- function(a, ...) {
  if (missing(...)) {
    node_labels <- 1:length(ls(a, pattern="[^node]")) 
  } else {
    node_labels <- list(...)
  }
  for (i in 1:length(node_labels)) {
      print(paste0("audit trail: ", node_labels[i]))
      print(head(a[[  as.character(node_labels[i])   ]] ))
  }
} 

How you use the audit trail

Below, I illustrate how to build and inspect an audit trail for the example pipeline.

### Employ an auditing object with the pipeline

start_audit() -> audit #Initialize the audit object (set node to zero)

iris %>%
  a_(mutate, Sepal.Area = Sepal.Width * Sepal.Length) %>%
  a_(mutate, Petal.Area = Petal.Width * Petal.Length) %>%
  a_(filter, Sepal.Area > mean(Sepal.Area)) %>%
  a_(select, -Petal.Length, -Sepal.Length) %>%
  a_(group_by, Species) %>%
  a_(summarize, n() ) -> iris2

Note that the a_(…) wraps around each unit operation in the pipeline, and the first argument is the unit operation function in the pipeline.

At this point, we can inspect the objects audited in the pipeline:

audit %>% audit_trail #reveal all objects
## [1] "audit trail: 1"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## [1] "audit trail: 2"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06
## [1] "audit trail: 3"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06
##   Petal.Area
## 1       0.28
## 2       0.28
## 3       0.26
## 4       0.30
## 5       0.28
## 6       0.68
## [1] "audit trail: 4"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          5.0         3.6          1.4         0.2  setosa      18.00
## 3          5.4         3.9          1.7         0.4  setosa      21.06
## 4          5.4         3.7          1.5         0.2  setosa      19.98
## 5          5.8         4.0          1.2         0.2  setosa      23.20
## 6          5.7         4.4          1.5         0.4  setosa      25.08
##   Petal.Area
## 1       0.28
## 2       0.28
## 3       0.68
## 4       0.30
## 5       0.24
## 6       0.60
## [1] "audit trail: 5"
##   Sepal.Width Petal.Width Species Sepal.Area Petal.Area
## 1         3.5         0.2  setosa      17.85       0.28
## 2         3.6         0.2  setosa      18.00       0.28
## 3         3.9         0.4  setosa      21.06       0.68
## 4         3.7         0.2  setosa      19.98       0.30
## 5         4.0         0.2  setosa      23.20       0.24
## 6         4.4         0.4  setosa      25.08       0.60
## [1] "audit trail: 6"
## Source: local data frame [6 x 5]
## Groups: Species [1]
## 
##   Sepal.Width Petal.Width Species Sepal.Area Petal.Area
##                              
## 1         3.5         0.2  setosa      17.85       0.28
## 2         3.6         0.2  setosa      18.00       0.28
## 3         3.9         0.4  setosa      21.06       0.68
## 4         3.7         0.2  setosa      19.98       0.30
## 5         4.0         0.2  setosa      23.20       0.24
## 6         4.4         0.4  setosa      25.08       0.60
audit %>% audit_trail(2,3)#reveal the object at the 2nd and 3rd node
## [1] "audit trail: 2"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06
## [1] "audit trail: 3"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06
##   Petal.Area
## 1       0.28
## 2       0.28
## 3       0.26
## 4       0.30
## 5       0.28
## 6       0.68

nb. I’ve written the audit_trail function to display only the head of each object.

Next steps

  • Solicit feedback from the R community on this idea.

Revised and republished 2017-09-29.

jabustyerman@gmail.com