Originally published July 22, 2016 (github -> https://github.com/jabus/givR)
How do you audit the objects in your R data analysis pipeline?
Given a dataframe object and a series of piped operations, (how) can you observe the intermediate objects that result from each operation in the pipeline?
First, prepare your analysis environment by loading some useful libraries:
library(dplyr) # load power packages
library(magrittr)
library(tidyr)
Here’s an example pipeline:
iris %>%
mutate(Sepal.Area = Sepal.Width * Sepal.Length) %>%
mutate(Petal.Area = Petal.Width * Petal.Length) %>%
filter(Sepal.Area > mean(Sepal.Area)) %>%
select(-Petal.Length, -Sepal.Length) %>%
group_by(Species) %>%
summarize(n()) -> iris2
print(iris2)
## Source: local data frame [3 x 2]
##
## Species n()
##
## 1 setosa 20
## 2 versicolor 16
## 3 virginica 35
You can inspect the dataframes that enter and exit the pipeline (using the print command, for example), but how can you observe the intermediate objects generated along the pipeline? This is the challenge I explore in this blog post.
Line by line
You can highlight and source (or copy and paste) each section of the pipeline and observe the resulting object. This is okay for development, but is ephemeral and not very useful in research operations.
Assignment statements
You can use the assign statement to copy the intermediate objects, and observe those copied instances later, e.g.:
iris %>%
mutate(Sepal.Area = Sepal.Width * Sepal.Length) %>%
assign(x="x1",value=., pos=1) %>%
mutate(Petal.Area = Petal.Width * Petal.Length) %>%
assign(x="x2",value=., pos=1) %>%
filter(Sepal.Area > mean(Sepal.Area)) %>%
assign(x="x3",value=., pos=1) %>%
select(-Petal.Length, -Sepal.Length) %>%
assign(x="x4",value=., pos=1) %>%
group_by(Species) %>%
assign(x="x5",value=., pos=1) %>%
summarize(n()) -> iris2
print(head(x1))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 5.1 3.5 1.4 0.2 setosa 17.85
## 2 4.9 3.0 1.4 0.2 setosa 14.70
## 3 4.7 3.2 1.3 0.2 setosa 15.04
## 4 4.6 3.1 1.5 0.2 setosa 14.26
## 5 5.0 3.6 1.4 0.2 setosa 18.00
## 6 5.4 3.9 1.7 0.4 setosa 21.06
This works, although the alternating assignment lines along the pipeline feels a bit repetitive.
Building an audit trail
Here, I build on the above idea, by copying and collecting intermediate dataframes (objects) along the pipeline, into an audit trail.
There are two steps to building an audit trail. First, you initiate the audit trail by instantiating an audit object with the start_audit() function.
Second, you wrap any (all) piped operations in your pipeline with the a_() function. This takes a snapshot of the dataframe entering that piped operation and saves it to your audit trail.
After the pipe has completed, you can inspect the intermediate dataframes captured in your audit trail using the audit_trail() function.
Here are the function definitions for building and using an audit trail. (I may package these into an R package at some point in time.)
#function definitions
start_audit <- function() { #initialize (reset) the node counter to zero
audit <- new.env()
audit$node <- 0
return(audit)
}
a_ <- function(x, f, ...) { #main functionality is captured with a_
audit$node <<- audit$node + 1
audit[[as.character(audit$node)]] <- x x %>% f(...)
}
audit_trail <- function(a, ...) {
if (missing(...)) {
node_labels <- 1:length(ls(a, pattern="[^node]"))
} else {
node_labels <- list(...)
}
for (i in 1:length(node_labels)) {
print(paste0("audit trail: ", node_labels[i]))
print(head(a[[ as.character(node_labels[i]) ]] ))
}
}
How you use the audit trail
Below, I illustrate how to build and inspect an audit trail for the example pipeline.
### Employ an auditing object with the pipeline
start_audit() -> audit #Initialize the audit object (set node to zero)
iris %>%
a_(mutate, Sepal.Area = Sepal.Width * Sepal.Length) %>%
a_(mutate, Petal.Area = Petal.Width * Petal.Length) %>%
a_(filter, Sepal.Area > mean(Sepal.Area)) %>%
a_(select, -Petal.Length, -Sepal.Length) %>%
a_(group_by, Species) %>%
a_(summarize, n() ) -> iris2
Note that the a_(…) wraps around each unit operation in the pipeline, and the first argument is the unit operation function in the pipeline.
At this point, we can inspect the objects audited in the pipeline:
audit %>% audit_trail #reveal all objects
## [1] "audit trail: 1"
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## [1] "audit trail: 2"
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 5.1 3.5 1.4 0.2 setosa 17.85
## 2 4.9 3.0 1.4 0.2 setosa 14.70
## 3 4.7 3.2 1.3 0.2 setosa 15.04
## 4 4.6 3.1 1.5 0.2 setosa 14.26
## 5 5.0 3.6 1.4 0.2 setosa 18.00
## 6 5.4 3.9 1.7 0.4 setosa 21.06
## [1] "audit trail: 3"
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 5.1 3.5 1.4 0.2 setosa 17.85
## 2 4.9 3.0 1.4 0.2 setosa 14.70
## 3 4.7 3.2 1.3 0.2 setosa 15.04
## 4 4.6 3.1 1.5 0.2 setosa 14.26
## 5 5.0 3.6 1.4 0.2 setosa 18.00
## 6 5.4 3.9 1.7 0.4 setosa 21.06
## Petal.Area
## 1 0.28
## 2 0.28
## 3 0.26
## 4 0.30
## 5 0.28
## 6 0.68
## [1] "audit trail: 4"
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 5.1 3.5 1.4 0.2 setosa 17.85
## 2 5.0 3.6 1.4 0.2 setosa 18.00
## 3 5.4 3.9 1.7 0.4 setosa 21.06
## 4 5.4 3.7 1.5 0.2 setosa 19.98
## 5 5.8 4.0 1.2 0.2 setosa 23.20
## 6 5.7 4.4 1.5 0.4 setosa 25.08
## Petal.Area
## 1 0.28
## 2 0.28
## 3 0.68
## 4 0.30
## 5 0.24
## 6 0.60
## [1] "audit trail: 5"
## Sepal.Width Petal.Width Species Sepal.Area Petal.Area
## 1 3.5 0.2 setosa 17.85 0.28
## 2 3.6 0.2 setosa 18.00 0.28
## 3 3.9 0.4 setosa 21.06 0.68
## 4 3.7 0.2 setosa 19.98 0.30
## 5 4.0 0.2 setosa 23.20 0.24
## 6 4.4 0.4 setosa 25.08 0.60
## [1] "audit trail: 6"
## Source: local data frame [6 x 5]
## Groups: Species [1]
##
## Sepal.Width Petal.Width Species Sepal.Area Petal.Area
##
## 1 3.5 0.2 setosa 17.85 0.28
## 2 3.6 0.2 setosa 18.00 0.28
## 3 3.9 0.4 setosa 21.06 0.68
## 4 3.7 0.2 setosa 19.98 0.30
## 5 4.0 0.2 setosa 23.20 0.24
## 6 4.4 0.4 setosa 25.08 0.60
audit %>% audit_trail(2,3)#reveal the object at the 2nd and 3rd node
## [1] "audit trail: 2"
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 5.1 3.5 1.4 0.2 setosa 17.85
## 2 4.9 3.0 1.4 0.2 setosa 14.70
## 3 4.7 3.2 1.3 0.2 setosa 15.04
## 4 4.6 3.1 1.5 0.2 setosa 14.26
## 5 5.0 3.6 1.4 0.2 setosa 18.00
## 6 5.4 3.9 1.7 0.4 setosa 21.06
## [1] "audit trail: 3"
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 5.1 3.5 1.4 0.2 setosa 17.85
## 2 4.9 3.0 1.4 0.2 setosa 14.70
## 3 4.7 3.2 1.3 0.2 setosa 15.04
## 4 4.6 3.1 1.5 0.2 setosa 14.26
## 5 5.0 3.6 1.4 0.2 setosa 18.00
## 6 5.4 3.9 1.7 0.4 setosa 21.06
## Petal.Area
## 1 0.28
## 2 0.28
## 3 0.26
## 4 0.30
## 5 0.28
## 6 0.68
nb. I’ve written the audit_trail function to display only the head of each object.
Next steps
- Solicit feedback from the R community on this idea.
Revised and republished 2017-09-29.