Info

Computers aren't the thing. They're the thing that gets you to the thing.

Posts from the blog Category

There are many ways that Data Science could add value. In theory.

In practice, I’ve found that there is one way for Data Science to add value, and that way is in supporting a decision making process.

Note that supporting a decision making process is not the same as “making the decision” — ultimately this must rest with a human that is accountable for the decision. This may mean that if you use software to “make decisions,” you are still responsible and accountable as you decided to employ the software as your proxy.

The real opportunity arises when a decision has uncertainty (or ambiguity) and some “data science” — read system and associated analyses — can be applied to  1) quantify the uncertainty, and 2) reduce the uncertainty for the decision. Using information to reduce uncertainty adds value that can be quantified.

What else does Data Science do?

-is it a workflow, or pipeline or database architecture? Maybe in part, but only if it facilitates decision making.

-knowledge management? Maybe in part, but only if it enables decisions based on that knowledge.

-improve understanding? Perhaps, but how do you translate that improved understanding into value? –only by your subsequent actions, i.e., decisions.

In the end, whether you invest in Data Science should really depend on whether you have decisions where you want to reduce the uncertainty associated with different actions or choices.

When taking on a new role, you’ll inevitably have new tasks and responsibilities. Some of those tasks will be repeated, and at some point you might wonder whether automating the task would be more efficient than continuing to do the task manually.

For me, the threshold for automation typically happens after I’ve done a task just once. And knowing I may have to repeat a task actually influences how I undertake the task the first time. That is, while doing the task, I’m asking myself how I might change what I need to do right now, based on the likelihood that I would have to do the task 10 more times, or 1,000 more times.

I’ve worked in biotechnology companies where we would typically run an experiment, and after analyzing the results, decide to re-run the experiment —  perhaps with a minor tweak to the protocol, or using a different set of subjects or experimental conditions. In some cases, the insights derived from the experiments pushed us to make the experiment operational, that is, a regular and repeated part of our workflow. This meant that we would periodically run the experiment once a week, or once a month, to generate new data and glean new insights.

Whether the actual experiment could or should be automated would depend on the nature of experiment.

However, analyzing data from the experiment was almost always a candidate for automation, and if I knew I had to re-fresh the data and do the analysis again, I would always write R scripts to automate the data analysis and report generation. For me, I wanted to automate the parts of the workflow that took time, required attention to detail and precision, were well-defined, and were often mundane and boring and error prone.

Once achieved, the automation gives one the time to think about the message in the data, and the time to translate the insights from the data into value for the organization.

Once you’ve decided that a task — like data analysis — should be automated, you are faced with the decision of How you go about automating the process? The answer may depend on whether you and your colleagues have the skills to do the automation, or the capital to hire others to do the automation, or purchase a drop-in solution that enables automation (e.g., software, etc.).

As a grad student, the balance of time and money favored me choosing to learn the tools (in this case statistics, R,  python, and SQL) to do the automation myself.

However, in a business environment, learning these tools may not be the best option. Learning these tools takes a lot of time and effort, and that equates to an opportunity cost, i.e., your time and energy may be better spent applying your expertise and thinking about your problem domain. Remember that these tools ultimately result in a system that automates the task in question. This system can involve software (with source code), databases (with supporting architecture), web services (with access and security), and the time to address new issues (changes in requirements) as they arise in your process. With these considerations in mind, it may make sense to hire dedicated staff (if the process and system is large enough to warrant this choice), or purchase or subscribe to a software system that meets your automation needs.


Having said that, here’s a shameless plug for a new service we’re offering:

Yukon Data Solutions is our data analytics and report generation service. We work with biotechnology and life science teams to automate their data analytics pipelines. We focus on the rapid development and deployment of light-weight, customized automation solutions. Our goal is to take care of the painful parts of data analysis and report generation, so you can focus on the more important task of thinking about what it all means. Drop us an email if you want to determine if Yukon Data Solutions might be a good fit for your needs. One last thing: Our service is 100% satisfaction guaranteed.

320x320-meta-tag-og-image

Two modes of work

I want to contrast two modes of work in R&D environments. The first is Engineering mode, and the second is Discovery mode. These modes of work often share similar rules of logic, rely on similar concepts and frameworks, and leverage the same scientific tools and technologies. Engineering and Discovery modes can rely on using the same language (and jargon). Engineering and Discovery modes often occur in similar environments, and even in the same person (albeit usually at different times). In education, these modes are lumped together: The Science & Engineering of STEM roughly map to the Discovery and Engineering modes I’m describing here.

A critical and defining difference between Engineering and Discovery modes, however, is revealed in the distribution of outcomes (or “typical results” and “surprises”) associated with these work modes. Others have argued that the defining difference lies in the flow of information (Drexler, see Review here), but I will focus on outcomes in this post.

For work conducted in Engineering mode, most outcomes (i.e., typical results) look like success. Software engineers produce software that works 99% of the
time. Bridge and aeronautical engineers produce bridges and airplanes that
work 99.999999% of the time. My numbers might be off, but the idea is that typical results are successful results, and moreover, typical results are often close to perfect.

For Discovery mode, the expectation of outcomes is the opposite. Most outcomes look like failure. Visualize gold mines and oil field explorations and drug discovery — where success happens 1%, 0.01% or 0.00001% (or less!) of the time. My point is that typical outcomes are failures and misses.

The asymmetry in typical outcomes between Engineering and Discovery mode activities is illustrated below.

 

text4197-6

The histograms illustrate the outcomes (“Payoffs”) for Discovery (top panel) and Engineering (bottom panel) modes respectively. Payoffs can be negative (red) or positive (black).  Surprise (Extreme and rare events) is either pleasant (Discovery) or catastrophic (Engineering). Where do you want to operate?

 

For Engineering mode (bottom panel), most outcomes are positive, but not that positive. When surprising events occur for Engineering mode, they are almost always surprising in a bad way and can be disastrous. Trains, plains and automobiles mostly run on time, except when they don’t – then they are almost always late.

For Discovery mode (top panel), most outcomes are negative, but not that negative. When your latest batch of compounds fails to have an effect on a cancer cell line, you might be disappointed, but you probably aren’t surprised. But for Discovery mode, when surprising events do occur, they are almost always surprising in a good way (jackpot!). The stream of misses and failures are explained as the cost of doing business.

An Antifragile aside

 

taleb

Discovery mode is where it’s at

The asymmetry between Engineering and Discovery modes maps to Nassim Taleb’s Sucker and Non-sucker categories, as described in his books The Blackswan and Antifragile. Here, Engineering is the Sucker and considered fragile because Engineering suffers from “tail events” (and increases in volatility), while Discovery is the Non-sucker and considered antifragile because Discovery benefits from tail events (and increases in volatility). Here, I’m referring to a mode of work. With respect to being an employee (see below), Taleb has argued that employment (dependence on a paycheck from others) is a fragile endeavor (for the employee).

Back to our story…

In larger companies, employees may often take on roles that are predominately either one mode or the other. Scientists/Engineers may often be trying to discover new insights or develop new technology, and thus spend most of their time in Discovery mode. Engineers, operators, and service staff, in contrast, may be working with well developed systems and operate for the most part in Engineering mode.

In smaller companies (or on individual projects) it isn’t atypical for an individual to transition between Engineering and Discovery modes in the course of working on a project.

To illustrate this point: I recently worked on a data science project that required both modes of operation. The first step was to formulate a question and hypothesis: This was the work of a (Data) Scientist and fit neatly into the Discovery mode. The next step was to design and build a system for capturing and storing data to explore and test the hypothesis. This involved programming a web scraper and database for storing data — and because this step relied on my deployment of well understood programming technologies (where I had a high certainty that I could accomplish the task, >90%), it fit neatly into the Engineering mode. This step culminated in the successful construction and operation of a system to capture and store the data needed to answer (or at least explore) the question of interest.  Next, and with the data in hand, I shifted back to Discovery mode and began to explore the data. I was looking for patterns and insights. Here, although I had some ideas about what I might (and hoped to!) find, I had no certainty in any particular outcome. The results of this step (thus far) were my findings (insights and ideas about the way the world works, and more importantly new questions for next steps of inquiry).

For this project, the Engineering and Discovery modes of work were both necessary. Was I successful? It depends: I am 100% certain that I built a system for capturing and storing the data to answer the question (The Engineering mode activities). The jury is still out about whether the findings from my Discovery mode activities were interesting (or important).

Where do you want to operate?

With respect to your job or occupation or career, and if you have a choice, where do you want to spend your time (and on what sorts of activities?) — on Discovery or Engineering?

If you are an employee at an R&D company, my answer is “it depends”. If you are an employee, you likely have a manager. And if your manager is a good manager, then she will recognize that there are differences between Discovery and Engineering modes and activities and most importantly outcomes, and set expectations, incentives and compensation to reflect those different modes accordingly.

If, however, your manager is blind to the asymmetry in distributions of outcomes I’ve described above, then the last thing you want  is to be miss-classified by him and subjected to the expectations (KPIs) and reward structure (incentives and compensation) of the alternate mode. This can be especially bad for Discovery mode activities (and failures) when subjected to the expectations that a manager might have for Engineering mode activities (and successes). (And this is why it is important to distinguish between Data Engineering and Data Science.)

The economic environment under which a company operates can also impact where you want to operate on the Discovery-Engineering spectrum.  In boom times, when there is money in the bank, when there is excitement from early wins, it’s great to operate in Discovery mode. If your role, however is Engineering focused, then you might grow resentful when others are lauded for their amazing discoveries. My recommendation is to find Discovery mode opportunities in your otherwise Engineering (or Operational) filled world of activities.

If the economic environment erodes, and you typically operative in Discovery mode, however,  look out! When budgets get smaller, deadlines tighten and the easy and early wins give way to missed goals (or losses), then humans become fearful and conservative. Here, managers may increasingly incentivize employees to act conservatively to reduce the riskiness of their activities (and implicitly reduce the volatility of their results). For the Discovery mode employee, this reduction in volatility is disastrous, because it caps their upside and prevents them from recognizing any truly extreme (and beneficial) payoffs. If you find yourself in this position, my recommendation is to look for Engineering mode activities (temporarily) or seek out new employment opportunities.

 

Variation matters

Engineering mode shuns uncertainty, because uncertainty may involve risk that corresponds to bad surprises. Discovery mode thrives under uncertainty, especially when a rare but beneficial result leads to finding something new, or a reduction of uncertainty in the face of making strategic decisions.

In summary, to understand the distinctions of Discovery and Engineering modes, one needs to have an appreciation for variation and the underlying distribution of outcomes expected while operating in each mode respectively. Without understanding the asymmetry in their outcome distributions, it would be difficult to convey how these work modes are different.

By Jabus Tyerman (email: jabus@delvbio.com)

A post by Mark Perry about U-Haul truck rental prices suggested that so many people were moving away from San Jose, CA that it was causing shortages of U-Haul rental trucks in San Jose. In response to high demand for rental trucks, U-Haul was adjusting one-way truck rental prices for leaving San Jose to be many multiples of the prices to move from the same cities to San Jose. For example, Perry reported that the price to rent a truck to move from San Jose to Las Vegas, NV was 16x more than the price to move from Las Vegas to San Jose.

Perry suggested that U-Haul would use dynamic pricing to optimize one-way truck rental prices in response to local supply and demand, and that we could use the ratio of Outbound moving prices compared to Inbound moving prices to infer net movement of people between pairs of cities. I wondered if we could use this idea to get a real-time measurement of movement patterns (“migration”) among other US cities?

Because Perry only reported on price imbalances between San Jose and six other cities, I wanted to start by looking more broadly at pricing imbalances across the US. I collected U-Haul pricing data for the 100 largest US cities (by population), with the intention of ranking cities for Outflow (and Inflow), based on average imbalance in truck rental prices.

I paired each focal city (“A”) with every other city (“B”), and used the U-Haul website to collect one-way pricing quotes to rent a 10′ truck to move from A to B (Outbound), and from B to A (Inbound). I calculated the log(Outbound/Inbound) for each city pair, and then used the average result of all city pairs (for each focal city) to generate an index of migration for that city. I called this index the U-Haul Moving Index, or UMI.

For example, using San Jose, CA as the focal city (A):  The Outbound and Inbound one-way truck rental prices between San Jose and Tucson, AZ (B) were $1271 and $161 respectively. The Outbound/Inbound is $1271/$161 = 7.9 (and taking the log of 7.9 yields 2.06). This value was calculated using San Jose as A, and all cities as B, and the median value (“UMI”) for San Jose was 0.69.

[I converted the ratio of Outbound/Inbound to a log scale because taking the natural log of a ratio transforms the asymmetric linear scale into a symmetrical log scale. In this way, an Outbound price that is 2x the Inbound price will have the same scaling as an Outbound price that is 0.5x the Inbound price, i.e. a factor of 2 in both cases.]

Findings

The main results are illustrated in Figures 1 and 2 below.

Figure 1 ranks cities according to the U-Haul Moving Index (UMI). Cities having positive UMI (purple) are cities where Outbound prices are greater than Inbound prices, suggesting that trucks are in short supply, due to a net outflow of people. Cities having negative UMI (orange) are cities where Outbound prices are less than Inbound prices, due to a net inflow of people.

umi_plot

Figure 1. Cities ranked by Outflow (and Inflow) using the U-Haul Moving Index (UMI). Positive UMI (purple) cities are cities where Outbound prices are greater than Inbound prices (on average), suggesting trucks are in short supply relative to demand, due to a net outflow of people. Negative UMI (orange) cities are cities where Outbound prices are less than Inbound prices (on average), suggesting trucks are in surplus relative to demand, due to a net inflow of people.

(Click here for a high resolution .pdf of Figure 1)

Figure 2 is a map of 100 cities, colored by U-Haul Moving Index (UMI), and highlights strong regional patterns in these data.

Regions having positive UMIs — dominated by people leaving the regions — include California,  Chicago (and surrounding Lake Michigan states), New York City (and north eastern seaboard states), and Miami, FL.

Regions having negative UMIs — dominated by people arriving in the regions — include the south eastern states, Texas & Oklahoma, Arizona, and Boise, ID.

usa_map_of_umis

FIGURE 2. Map of US cities colored by U-Haul Moving Index (size reflects population).

How do these data compare to other studies?

Previously, U-Haul has analyzed its own data to report on migration trends. U-Haul used  total number of one way arrivals in 2017 to rank cities as US destinations, and found that Houston, TX and Chicago, IL were the top two destination cities. In contrast, UMI (this analysis) ranked Houston #42 (of 100) inflow cities, and Chicago as the top outflow city after California cities (Chicago was ranked 15/100). Additionally, U-Haul’s total one way arrival method ranked San Jose, CA as #42 in its list of top 50 destination cities, and this study using UMI ranks San Jose, CA as tied for top outflow city (#1 of 100). These differences in ranking may reflect differences in methodology.  However, because data in these studies came from different time periods (2017 for one way arrival data, and June 2018 for UMI data) it is conceivable that differences in city rankings are due to underlying differences in migration patterns rather than methodology. Whether one method is more accurate at describing patterns of migration has not been determined. In my opinion, however, the total one way arrivals method used by U-Haul appears to ignore the numbers of one-way departures, and may therefore not present a full accounting of inflows and outflows required to calculate the net flow of people to/from the city. Rental pricing — if dynamically optimized in response to local supply and demand — could better integrate information about in- and outflows of people.

The company Redfin has used house search data to estimate the movement of people and ranked San Francisco, New York and LA as top outflow cities (with Chicago as #5) — a result more in line with the UMI rankings in this study. However, Redfin also identified Sacramento, Phoenix and Las Vegas as top Destination cities, while the UMI in this study strongly ranked Sacramento as an outflow city and Las Vegas as having more balanced in- and outflow of people. (Phoenix had a moderately negative UMI suggesting it was a moderate inflow city). As with the U-Haul total one-way arrival method, it is unknown whether differences in city rankings between Redfin and this study stem from differences in methodology or differences in moving patterns due to the data being captured during different time periods.

One practical advantage of using the UMI method described in this study over the U-Haul total one-way arrival method and the Redfin home search method, is that U-Haul prices required for UMI calculations are readily available from the U-Haul website, while one way arrivals are available only to U-Haul, and home search data is available only to Redfin.

Conclusions

I used imbalances in U-Haul pricing data to generate U-Haul Moving Indices (UMI) for 100 US cities. This work builds on the ideas of Mark Perry in order to generate (near) real-time estimates of net people flows (Out- and Inflow) for each city.

While these data may reflect near real-time patterns of migration, they do not provide explanations why people are moving to- and from cities. Others have argued that taxes, cost of living, etc. spur people to leave cities and move on where conditions are better.

One of the assumptions I made (as did Perry) is that price is dynamically determined by U-Haul based on local supply and demand of rental trucks. U-Haul may  use other factors to set truck rental prices.

Follow up

This is a work in progress and I may update this post over time. If you have feedback or questions, I’d love to hear from you.

Methods

Cities are top U.S. cities (in lower 48 states) by population, based on 2013 census estimates using data available here.

U-Haul prices were obtained for one-way truck rentals (10′ trucks) collected over ~2 days from the U-Haul web site (https://www.uhaul.com) in June 2018.

Acknowledgements

Thanks to Andy Idsinga for fruitful discussions and insights that motivated this work.

Notes

  1. The blog post and study that motivated this work: Mark Perry (Feb 2018) SF Bay Area experiences mass exodus of residents, leading to a shortage of U-Haul trucks and sky-high prices for scarce outbound trucks. AEIdeas Blog. Last accessed 2018-06-18 from URL: http://www.aei.org/publication/san-francisco-bay-area-experiences-mass-exodus-of-residents-reflected-in-one-way-u-haul-truck-rental-rates/
  2. The U-Haul report on migration that used total one-way arrivals data: U-Haul (May 2018) U-Haul Migration Trends: Houston Ranks as No. 1 U.S. Destination. U-Haul Blog. Last accessed 2018-06-18 from URL: https://www.uhaul.com/Articles/About/14384/U-Haul-Migration-Trends-Houston-Ranks-As-No-1-Us-Destination/
  3. The Redfin study using home searches: Greg McCarriston (Feb 2018) Affordable Inland Metros Drew People from San Francisco, New York and Los Angeles. Redfin Blog. Last accessed 2018-06-14 from URL https://www.redfin.com/blog/2018/02/q4-migration-report.html.
  4. Blog post discussing explanations for people moving among cities (and states): Mark Perry (Feb 2018) . AEIdeas Blog. Last accessed 2018-06-18 from URL: https://www.aei.org/publication/americas-top-10-inbound-vs-top-10-outbound-states-how-do-they-compare-on-a-variety-of-tax-burden-business-climate-fiscal-health-and-economic-measures/

 

If you’re interested in analyses and graphs like the ones that appear in this article, then you might be interested in a new service I’ve started: Click here to visit Yukon Data Solutions.

It’s important that there is a good fit between me and my clients. Here are some of my  thoughts on assessing goodness of fit.

In an ideal relationship, the client:

  • Is a leader in biotechnology or life sciences industry. She may not work for a biotech company, but she is involved or affiliated with the biotech industry in some capacity.
  • Is facing a key challenge involving his data, data systems, or work culture around data. He knows their company is not getting the most from their data, and that it is time for some strategy and further investment.
  • Is a decision maker and she is motivated to maximize the value of the solution during  our engagement. This is not necessarily the same thing as minimizing the cost of the solution during our engagement.
  • Recognizes that his situation is likely to differ from other situations, and will require some diagnosis on my part before jumping into a solution.
  • Is prepared to describe the problem or challenge she is facing.
  • Is not simply looking for a pair of hands to execute a self-prescribed task list.

And I can offer my best work when:

  • The engagement starts at high-level with strategy and advice, and transitions to the execution of customized data science deep dives, projects or analyses.
  • The client’s problem is best tackled by a combination of my background in experimental biology, my experience in synthetic biology, and my expertise in developing assays, workflows and data pipelines & systems to support decision making in industry.
  • The situation can benefit from an outside perspective, and a diagnosis that is not affiliated with specific enterprise solutions or products.
  • I use data to test and support a client’s intuitions about their science and process & business.

If these ideas resonate with you, then we just might be a good fit and you should get in touch.

 

With the hype of deep learning neural nets, and machine learning algorithms, it’s easy to forget that most of the work in data science involves accessing and preparing data for analysis.  Indeed, not all data is Kaggle-ready. The reality is: data is often far from perfect.

facepalm

Very far from perfect.

Do your consultant (and budget) a favor and follow these rules-of-thumb when using spreadsheets to collect and organize your data.

First, do not rely on spreadsheet formatting to indicate associations in your data. The most common form of this mistake is to use color-coding to indicate treatment or group.

alt_rows_by_color

Here, the alternating red and green rows (are supposed to) convey information about alternating treatments.  If the software (or data scientist) is color blind, however, then the association of “row color” with “treatment” is unclear, and likely to be lost. My recommendation is to explicitly capture and encode information about treatment in a separate column.

alt_rows_by_color_fixed

A rule-of-thumb for good data preparation, is that each row should not rely on it’s formatting or color to convey metadata or information that might be required or useful for analysis. You do not want to have to rely on your memory that red is ‘a’ and green is ‘b’.

Second, never merge spreadsheet cells. Another place I see room for improvement is with the use of “merged” cells in spreadsheets. Excel provides the user with the ability to combine cells, perhaps with an eye to improving the (human) readability of the data contained in the table.

merged_data

Merged cells are problematic when exported to .csv

Here, the Position and Setting columns for each row has been grouped using the “merge” function in excel. The problem arises because most of the merged cells, when unmerged (during .csv export, say), are left blank. These blank cells result in the loss of required or useful group associations for these records.

Remember:  in preparing data for analysis, each row of data should be a stand-alone entity, and not rely on its relative position or order to convey part of its meaning. A test of this principle is, for any given table of data, you should be able to re-order your rows in any order, and not lose any required or useful information for analysis.

When setting up your spreadsheet for data analysis, the rule-of-thumb is to simply never  merge your data cells.

merged_data_fixed

Much more clear, especially for my R scripts.

Third, always use Data Validation tools for data entry. Excel is an excellent tool for entering data. However, if you are capturing and entering data by hand, there are ample opportunities for making difficult-to-detect data entry errors. Can you spot the error in this table?

non-standard-entry

With only ten rows, you probably noticed that the Treatment entry in the third row (cell C6) has an UPPER CASE label, while the rest of the entries for Treatment have lower case labels.

non-standard-entry-highlight

Most scripting languages (R, python) are case sensitive — by default — and would group ‘a’ and ‘A’ separately, possibly leading to errors (unless detected during a preliminary data audit).

When you set up your spreadsheet for data entry, take time to consider checking data as it is entered. The above error could be prevented using the Data Validation function in Excel. Here, a list of acceptable values (a,b) could be used to confirm each entry. Entries not on the list are detected and can be corrected.

non-standard-entry-use-data-val

Conditional Formatting (typically, color highlighting) is another useful Excel tool for quickly checking that data is entered correctly. Expected values will have a format (e.g., color) and unexpected values will not, making them stand out during visual inspection.

The fourth rule of thumb is to never (ever!) delete rows of data if you want the data excluded from the analysis. Once a cell or row is deleted, you no longer have any indication that your deleted data existed! The best practice for excluding data in an analysis, is to use a new column in the data table to act as a flag- (or indicator-) variable, and for each record, encode and capture whether that record should be included (e.g., 0=”include”, 1=”exclude”).

For additional clarity, a second column can be added that provides an explanation (perhaps encoded, or that uses Data Validation) for why the data is to be excluded from the analysis. In this way, if your consultant (or your future self) is curious and wants to analyze these “excluded” data, he or she has ability to do so.

qc-coded

The quality of the data is indicated by the QC_flag column (0=good, 1=bad) and the rational for the QC status can be noted in the QC_description column.

(If you “Format as Table” your data, you can easily apply filters so that the data you want to exclude is made invisible.)

Fifth, to help your consultant (and perhaps your future self) understand and audit the types and ranges of your data, create a key that explains each column of data in a table. This key can live in a separate, human readable document — perhaps as a README.txt file that is updated and saved in the same folder as the spreadsheet file, or as information in a separate tab on the spreadsheet, or as information at the top of each spreadsheet. If a column of data contains a factor (e.g., a label, or class, or group, or treatment), enumerate all the possible values that this factor could take (In the above table, the QC flag can be 0=good, or 1=bad). If a column contains numbers, indicate the range (and perhaps information about the precision) for those numbers (In the above table, the x1 column is a measurement in mm units, ranging from 0-50). This information can be critical for communicating what the data means — and therefore how the data could be used, and how it should “typically” look. Departures from “typical” values (e.g., negative measurements) can be flagged for later consideration (or correction).

The sixth rule for spreadsheets is to preserve the integrity of the data by separating the data from the analysis, and requires the understanding of a principle. The principle is: the data are the data. As such, there should exist a single source of truth for the data. There isn’t my data, and there isn’t your data. There is only the data. In practice, there should not be different versions of the same(ish) data.

This does not mean that you can not make versions of the analysis of data, only that the raw data itself should not change.

I bring this up because when it comes to data analysis, spreadsheets are a double edged sword. Spreadsheets are useful because they allow you to capture data and do analyses  of those data in the same setting (Excel!). This is good for two reasons. First, there is a relatively low barrier to entry for “doing data analysis.” Second, it is good because it enables rapid prototyping and exploration. Spreadsheets are dangerous, however, because they are too easy to alter the data (perhaps by accident) without indicating how or what the change was. This can lead to a proliferation of versions of the raw data, perhaps unintentionally.

Ideally the data should be sequestered from the user in a way that makes it easy to access, but very difficult to change. The user should be free to analyze the data in any way, as long as the original data isn’t altered. When organizing a data project (on your hard-drive) I would recommend making a subdirectory called “data” that contains only the data, and to save your data as read-only files. A second subdirectory, called “analysis” can contain files based on the raw data files, but including additional analyses or annotations (or even changes). If I am going to do some analyses in excel, I simply append “_analysis YYYY-MM-DD” to the name of the raw data file, and save it in the analysis subdirectory.

I would also be remiss if I didn’t take this opportunity to say that if you are planning on capturing a lot of (structured) data, it would be worth storing these data in a relational SQL database rather than in spreadsheet files. This practice helps to preserve the integrity of the data because it enforces the separation of data from analysis. (And I know a good consultant who can help you set up a database!)

The seventh rule for spreadsheets is to use a fixed spreadsheet template and to collect data in a series of spreadsheet files (rather than a series of tabs in a file). If you are collecting data from experiments that follow the same format, or make the same measurements or observations over different days, then consider setting up a template so that each collection of measurements is organized in the same way.

When analyzing your data, I’m going to be using a tool like R or a scripting language like python, and I will want to automate as much of that workflow as possible. To that end, using a spreadsheet template means the data is in a form that is expected (and anticipated by my R or python script) and makes the process of data inhalation easy, fast, and reliable.

Ideally, you’ll set up the data collection spreadsheet prior to collecting the data. It may also be a good idea to review the format of template after a few preliminary runs of data collection to make sure you have optimized the template before collecting most of your data.

(Even better, if you know who is going to do the analysis of your data, have that person review your spreadsheet template so that they can make suggestions to the template that will facilitate their analyses.)

There you have my seven rules for using spreadsheets to capture and organize your data for analyses and machine learning. Let me know if I’ve missed any useful rules!

Further Resources:

  1. Murrell, Paul. Chapter 3: Data Intended for Human Consumption, not Machine Consumption in Bad Data Handbook, Edited by Q. Ethan McCallum, O’Reilly Media Inc., Sebastopol, 2013. This entire book is worth reading, although the topics in other chapters extend beyond the scope of this post.
  2. Pyle, Dorian. Data Preparation for Data Mining Academic Press, San Francisco, 1999. One of the first books I read on understanding how to best prepare data for machine learning.

Originally published July 22, 2016 (github -> https://github.com/jabus/givR)

How do you audit the objects in your R data analysis pipeline?

Given a dataframe object and a series of piped operations, (how) can you observe the intermediate objects that result from each operation in the pipeline?

First, prepare your analysis environment by loading some useful libraries:

library(dplyr) # load power packages
library(magrittr)
library(tidyr)

Here’s an example pipeline:

iris %>%
  mutate(Sepal.Area = Sepal.Width * Sepal.Length) %>%
  mutate(Petal.Area = Petal.Width * Petal.Length) %>%
  filter(Sepal.Area > mean(Sepal.Area)) %>%
  select(-Petal.Length, -Sepal.Length) %>%
  group_by(Species) %>%
  summarize(n()) -> iris2

print(iris2)
## Source: local data frame [3 x 2]
## 
##      Species   n()
##        
## 1     setosa    20
## 2 versicolor    16
## 3  virginica    35

You can inspect the dataframes that enter and exit the pipeline (using the print command, for example), but how can you observe the intermediate objects generated along the pipeline? This is the challenge I explore in this blog post.

Line by line

You can highlight and source (or copy and paste) each section of the pipeline and observe the resulting object. This is okay for development, but is ephemeral and not very useful in research operations.

Assignment statements

You can use the assign statement to copy the intermediate objects, and observe those copied instances later, e.g.:

iris %>%
  mutate(Sepal.Area = Sepal.Width * Sepal.Length) %>%
  assign(x="x1",value=., pos=1) %>%
  mutate(Petal.Area = Petal.Width * Petal.Length) %>%
  assign(x="x2",value=., pos=1) %>%
  filter(Sepal.Area > mean(Sepal.Area)) %>%
  assign(x="x3",value=., pos=1) %>%
  select(-Petal.Length, -Sepal.Length) %>%
  assign(x="x4",value=., pos=1) %>%
  group_by(Species) %>%
  assign(x="x5",value=., pos=1) %>%
  summarize(n()) -> iris2


print(head(x1))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06

This works, although the alternating assignment lines along the pipeline feels a bit repetitive.

Building an audit trail

Here, I build on the above idea, by copying and collecting intermediate dataframes (objects) along the pipeline, into an audit trail.

There are two steps to building an audit trail. First, you initiate the audit trail by instantiating an audit object with the start_audit() function.

Second, you wrap any (all) piped operations in your pipeline with the a_() function. This takes a snapshot of the dataframe entering that piped operation and saves it to your audit trail.

After the pipe has completed, you can inspect the intermediate dataframes captured in your audit trail using the audit_trail() function.

Here are the function definitions for building and using an audit trail. (I may package these into an R package at some point in time.)

#function definitions

start_audit <- function() { #initialize (reset) the node counter to zero
  audit <- new.env()
  audit$node <- 0
  return(audit)
}

a_ <- function(x, f, ...) { #main functionality is captured with a_
  audit$node <<- audit$node + 1 
  audit[[as.character(audit$node)]] <- x   x %>% f(...)
}

audit_trail <- function(a, ...) {
  if (missing(...)) {
    node_labels <- 1:length(ls(a, pattern="[^node]")) 
  } else {
    node_labels <- list(...)
  }
  for (i in 1:length(node_labels)) {
      print(paste0("audit trail: ", node_labels[i]))
      print(head(a[[  as.character(node_labels[i])   ]] ))
  }
} 

How you use the audit trail

Below, I illustrate how to build and inspect an audit trail for the example pipeline.

### Employ an auditing object with the pipeline

start_audit() -> audit #Initialize the audit object (set node to zero)

iris %>%
  a_(mutate, Sepal.Area = Sepal.Width * Sepal.Length) %>%
  a_(mutate, Petal.Area = Petal.Width * Petal.Length) %>%
  a_(filter, Sepal.Area > mean(Sepal.Area)) %>%
  a_(select, -Petal.Length, -Sepal.Length) %>%
  a_(group_by, Species) %>%
  a_(summarize, n() ) -> iris2

Note that the a_(…) wraps around each unit operation in the pipeline, and the first argument is the unit operation function in the pipeline.

At this point, we can inspect the objects audited in the pipeline:

audit %>% audit_trail #reveal all objects
## [1] "audit trail: 1"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## [1] "audit trail: 2"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06
## [1] "audit trail: 3"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06
##   Petal.Area
## 1       0.28
## 2       0.28
## 3       0.26
## 4       0.30
## 5       0.28
## 6       0.68
## [1] "audit trail: 4"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          5.0         3.6          1.4         0.2  setosa      18.00
## 3          5.4         3.9          1.7         0.4  setosa      21.06
## 4          5.4         3.7          1.5         0.2  setosa      19.98
## 5          5.8         4.0          1.2         0.2  setosa      23.20
## 6          5.7         4.4          1.5         0.4  setosa      25.08
##   Petal.Area
## 1       0.28
## 2       0.28
## 3       0.68
## 4       0.30
## 5       0.24
## 6       0.60
## [1] "audit trail: 5"
##   Sepal.Width Petal.Width Species Sepal.Area Petal.Area
## 1         3.5         0.2  setosa      17.85       0.28
## 2         3.6         0.2  setosa      18.00       0.28
## 3         3.9         0.4  setosa      21.06       0.68
## 4         3.7         0.2  setosa      19.98       0.30
## 5         4.0         0.2  setosa      23.20       0.24
## 6         4.4         0.4  setosa      25.08       0.60
## [1] "audit trail: 6"
## Source: local data frame [6 x 5]
## Groups: Species [1]
## 
##   Sepal.Width Petal.Width Species Sepal.Area Petal.Area
##                              
## 1         3.5         0.2  setosa      17.85       0.28
## 2         3.6         0.2  setosa      18.00       0.28
## 3         3.9         0.4  setosa      21.06       0.68
## 4         3.7         0.2  setosa      19.98       0.30
## 5         4.0         0.2  setosa      23.20       0.24
## 6         4.4         0.4  setosa      25.08       0.60
audit %>% audit_trail(2,3)#reveal the object at the 2nd and 3rd node
## [1] "audit trail: 2"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06
## [1] "audit trail: 3"
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
## 3          4.7         3.2          1.3         0.2  setosa      15.04
## 4          4.6         3.1          1.5         0.2  setosa      14.26
## 5          5.0         3.6          1.4         0.2  setosa      18.00
## 6          5.4         3.9          1.7         0.4  setosa      21.06
##   Petal.Area
## 1       0.28
## 2       0.28
## 3       0.26
## 4       0.30
## 5       0.28
## 6       0.68

nb. I’ve written the audit_trail function to display only the head of each object.

Next steps

  • Solicit feedback from the R community on this idea.

Revised and republished 2017-09-29.

jabustyerman@gmail.com