When taking on a new role, you’ll inevitably have new tasks and responsibilities. Some of those tasks will be repeated, and at some point you might wonder whether automating the task would be more efficient than continuing to do the task manually.
For me, the threshold for automation typically happens after I’ve done a task just once. And knowing I may have to repeat a task actually influences how I undertake the task the first time. That is, while doing the task, I’m asking myself how I might change what I need to do right now, based on the likelihood that I would have to do the task 10 more times, or 1,000 more times.
I’ve worked in biotechnology companies where we would typically run an experiment, and after analyzing the results, decide to re-run the experiment — perhaps with a minor tweak to the protocol, or using a different set of subjects or experimental conditions. In some cases, the insights derived from the experiments pushed us to make the experiment operational, that is, a regular and repeated part of our workflow. This meant that we would periodically run the experiment once a week, or once a month, to generate new data and glean new insights.
Whether the actual experiment could or should be automated would depend on the nature of experiment.
However, analyzing data from the experiment was almost always a candidate for automation, and if I knew I had to re-fresh the data and do the analysis again, I would always write R scripts to automate the data analysis and report generation. For me, I wanted to automate the parts of the workflow that took time, required attention to detail and precision, were well-defined, and were often mundane and boring and error prone.
Once achieved, the automation gives one the time to think about the message in the data, and the time to translate the insights from the data into value for the organization.
Once you’ve decided that a task — like data analysis — should be automated, you are faced with the decision of How you go about automating the process? The answer may depend on whether you and your colleagues have the skills to do the automation, or the capital to hire others to do the automation, or purchase a drop-in solution that enables automation (e.g., software, etc.).
As a grad student, the balance of time and money favored me choosing to learn the tools (in this case statistics, R, python, and SQL) to do the automation myself.
However, in a business environment, learning these tools may not be the best option. Learning these tools takes a lot of time and effort, and that equates to an opportunity cost, i.e., your time and energy may be better spent applying your expertise and thinking about your problem domain. Remember that these tools ultimately result in a system that automates the task in question. This system can involve software (with source code), databases (with supporting architecture), web services (with access and security), and the time to address new issues (changes in requirements) as they arise in your process. With these considerations in mind, it may make sense to hire dedicated staff (if the process and system is large enough to warrant this choice), or purchase or subscribe to a software system that meets your automation needs.
Having said that, here’s a shameless plug for a new service we’re offering:
Yukon Data Solutions is our data analytics and report generation service. We work with biotechnology and life science teams to automate their data analytics pipelines. We focus on the rapid development and deployment of light-weight, customized automation solutions. Our goal is to take care of the painful parts of data analysis and report generation, so you can focus on the more important task of thinking about what it all means. Drop us an email if you want to determine if Yukon Data Solutions might be a good fit for your needs. One last thing: Our service is 100% satisfaction guaranteed.
I want to contrast two modes of work in R&D environments. The first is Engineering mode, and the second is Discovery mode. These modes of work often share similar rules of logic, rely on similar concepts and frameworks, and leverage the same scientific tools and technologies. Engineering and Discovery modes can rely on using the same language (and jargon). Engineering and Discovery modes often occur in similar environments, and even in the same person (albeit usually at different times). In education, these modes are lumped together: The Science & Engineering of STEM roughly map to the Discovery and Engineering modes I’m describing here.
A critical and defining difference between Engineering and Discovery modes, however, is revealed in the distribution of outcomes (or “typical results” and “surprises”) associated with these work modes. Others have argued that the defining difference lies in the flow of information (Drexler, see Review here), but I will focus on outcomes in this post.
For work conducted in Engineering mode, most outcomes (i.e., typical results) look like success. Software engineers produce software that works 99% of the
time. Bridge and aeronautical engineers produce bridges and airplanes that
work 99.999999% of the time. My numbers might be off, but the idea is that typical results are successful results, and moreover, typical results are often close to perfect.
For Discovery mode, the expectation of outcomes is the opposite. Most outcomes look like failure. Visualize gold mines and oil field explorations and drug discovery — where success happens 1%, 0.01% or 0.00001% (or less!) of the time. My point is that typical outcomes are failures and misses.
The asymmetry in typical outcomes between Engineering and Discovery mode activities is illustrated below.
For Engineering mode (bottom panel), most outcomes are positive, but not that positive. When surprising events occur for Engineering mode, they are almost always surprising in a bad way and can be disastrous. Trains, plains and automobiles mostly run on time, except when they don’t – then they are almost always late.
For Discovery mode (top panel), most outcomes are negative, but not that negative. When your latest batch of compounds fails to have an effect on a cancer cell line, you might be disappointed, but you probably aren’t surprised. But for Discovery mode, when surprising events do occur, they are almost always surprising in a good way (jackpot!). The stream of misses and failures are explained as the cost of doing business.
In larger companies, employees may often take on roles that are predominately either one mode or the other. Scientists/Engineers may often be trying to discover new insights or develop new technology, and thus spend most of their time in Discovery mode. Engineers, operators, and service staff, in contrast, may be working with well developed systems and operate for the most part in Engineering mode.
In smaller companies (or on individual projects) it isn’t atypical for an individual to transition between Engineering and Discovery modes in the course of working on a project.
To illustrate this point: I recently worked on a data science project that required both modes of operation. The first step was to formulate a question and hypothesis: This was the work of a (Data) Scientist and fit neatly into the Discovery mode. The next step was to design and build a system for capturing and storing data to explore and test the hypothesis. This involved programming a web scraper and database for storing data — and because this step relied on my deployment of well understood programming technologies (where I had a high certainty that I could accomplish the task, >90%), it fit neatly into the Engineering mode. This step culminated in the successful construction and operation of a system to capture and store the data needed to answer (or at least explore) the question of interest. Next, and with the data in hand, I shifted back to Discovery mode and began to explore the data. I was looking for patterns and insights. Here, although I had some ideas about what I might (and hoped to!) find, I had no certainty in any particular outcome. The results of this step (thus far) were my findings (insights and ideas about the way the world works, and more importantly new questions for next steps of inquiry).
For this project, the Engineering and Discovery modes of work were both necessary. Was I successful? It depends: I am 100% certain that I built a system for capturing and storing the data to answer the question (The Engineering mode activities). The jury is still out about whether the findings from my Discovery mode activities were interesting (or important).
Where do you want to operate?
With respect to your job or occupation or career, and if you have a choice, where do you want to spend your time (and on what sorts of activities?) — on Discovery or Engineering?
If you are an employee at an R&D company, my answer is “it depends”. If you are an employee, you likely have a manager. And if your manager is a good manager, then she will recognize that there are differences between Discovery and Engineering modes and activities and most importantly outcomes, and set expectations, incentives and compensation to reflect those different modes accordingly.
If, however, your manager is blind to the asymmetry in distributions of outcomes I’ve described above, then the last thing you want is to be miss-classified by him and subjected to the expectations (KPIs) and reward structure (incentives and compensation) of the alternate mode. This can be especially bad for Discovery mode activities (and failures) when subjected to the expectations that a manager might have for Engineering mode activities (and successes). (And this is why it is important to distinguish between Data Engineering and Data Science.)
The economic environment under which a company operates can also impact where you want to operate on the Discovery-Engineering spectrum. In boom times, when there is money in the bank, when there is excitement from early wins, it’s great to operate in Discovery mode. If your role, however is Engineering focused, then you might grow resentful when others are lauded for their amazing discoveries. My recommendation is to find Discovery mode opportunities in your otherwise Engineering (or Operational) filled world of activities.
If the economic environment erodes, and you typically operative in Discovery mode, however, look out! When budgets get smaller, deadlines tighten and the easy and early wins give way to missed goals (or losses), then humans become fearful and conservative. Here, managers may increasingly incentivize employees to act conservatively to reduce the riskiness of their activities (and implicitly reduce the volatility of their results). For the Discovery mode employee, this reduction in volatility is disastrous, because it caps their upside and prevents them from recognizing any truly extreme (and beneficial) payoffs. If you find yourself in this position, my recommendation is to look for Engineering mode activities (temporarily) or seek out new employment opportunities.
Engineering mode shuns uncertainty, because uncertainty may involve risk that corresponds to bad surprises. Discovery mode thrives under uncertainty, especially when a rare but beneficial result leads to finding something new, or a reduction of uncertainty in the face of making strategic decisions.
In summary, to understand the distinctions of Discovery and Engineering modes, one needs to have an appreciation for variation and the underlying distribution of outcomes expected while operating in each mode respectively. Without understanding the asymmetry in their outcome distributions, it would be difficult to convey how these work modes are different.
A post by Mark Perry about U-Haul truck rental prices suggested that so many people were moving away from San Jose, CA that it was causing shortages of U-Haul rental trucks in San Jose. In response to high demand for rental trucks, U-Haul was adjusting one-way truck rental prices for leaving San Jose to be many multiples of the prices to move from the same cities to San Jose. For example, Perry reported that the price to rent a truck to move from San Jose to Las Vegas, NV was 16x more than the price to move from Las Vegas to San Jose.
Perry suggested that U-Haul would use dynamic pricing to optimize one-way truck rental prices in response to local supply and demand, and that we could use the ratio of Outbound moving prices compared to Inbound moving prices to infer net movement of people between pairs of cities. I wondered if we could use this idea to get a real-time measurement of movement patterns (“migration”) among other US cities?
Because Perry only reported on price imbalances between San Jose and six other cities, I wanted to start by looking more broadly at pricing imbalances across the US. I collected U-Haul pricing data for the 100 largest US cities (by population), with the intention of ranking cities for Outflow (and Inflow), based on average imbalance in truck rental prices.
I paired each focal city (“A”) with every other city (“B”), and used the U-Haul website to collect one-way pricing quotes to rent a 10′ truck to move from A to B (Outbound), and from B to A (Inbound). I calculated the log(Outbound/Inbound) for each city pair, and then used the average result of all city pairs (for each focal city) to generate an index of migration for that city. I called this index the U-Haul Moving Index, or UMI.
For example, using San Jose, CA as the focal city (A): The Outbound and Inbound one-way truck rental prices between San Jose and Tucson, AZ (B) were $1271 and $161 respectively. The Outbound/Inbound is $1271/$161 = 7.9 (and taking the log of 7.9 yields 2.06). This value was calculated using San Jose as A, and all cities as B, and the median value (“UMI”) for San Jose was 0.69.
[I converted the ratio of Outbound/Inbound to a log scale because taking the natural log of a ratio transforms the asymmetric linear scale into a symmetrical log scale. In this way, an Outbound price that is 2x the Inbound price will have the same scaling as an Outbound price that is 0.5x the Inbound price, i.e. a factor of 2 in both cases.]
The main results are illustrated in Figures 1 and 2 below.
Figure 1 ranks cities according to the U-Haul Moving Index (UMI). Cities having positive UMI (purple) are cities where Outbound prices are greater than Inbound prices, suggesting that trucks are in short supply, due to a net outflow of people. Cities having negative UMI (orange) are cities where Outbound prices are less than Inbound prices, due to a net inflow of people.
Figure 2 is a map of 100 cities, colored by U-Haul Moving Index (UMI), and highlights strong regional patterns in these data.
Regions having positive UMIs — dominated by people leaving the regions — include California, Chicago (and surrounding Lake Michigan states), New York City (and north eastern seaboard states), and Miami, FL.
Regions having negative UMIs — dominated by people arriving in the regions — include the south eastern states, Texas & Oklahoma, Arizona, and Boise, ID.
How do these data compare to other studies?
Previously, U-Haul has analyzed its own data to report on migration trends. U-Haul used total number of one way arrivals in 2017 to rank cities as US destinations, and found that Houston, TX and Chicago, IL were the top two destination cities. In contrast, UMI (this analysis) ranked Houston #42 (of 100) inflow cities, and Chicago as the top outflow city after California cities (Chicago was ranked 15/100). Additionally, U-Haul’s total one way arrival method ranked San Jose, CA as #42 in its list of top 50 destination cities, and this study using UMI ranks San Jose, CA as tied for top outflow city (#1 of 100). These differences in ranking may reflect differences in methodology. However, because data in these studies came from different time periods (2017 for one way arrival data, and June 2018 for UMI data) it is conceivable that differences in city rankings are due to underlying differences in migration patterns rather than methodology. Whether one method is more accurate at describing patterns of migration has not been determined. In my opinion, however, the total one way arrivals method used by U-Haul appears to ignore the numbers of one-way departures, and may therefore not present a full accounting of inflows and outflows required to calculate the net flow of people to/from the city. Rental pricing — if dynamically optimized in response to local supply and demand — could better integrate information about in- and outflows of people.
The company Redfin has used house search data to estimate the movement of people and ranked San Francisco, New York and LA as top outflow cities (with Chicago as #5) — a result more in line with the UMI rankings in this study. However, Redfin also identified Sacramento, Phoenix and Las Vegas as top Destination cities, while the UMI in this study strongly ranked Sacramento as an outflow city and Las Vegas as having more balanced in- and outflow of people. (Phoenix had a moderately negative UMI suggesting it was a moderate inflow city). As with the U-Haul total one-way arrival method, it is unknown whether differences in city rankings between Redfin and this study stem from differences in methodology or differences in moving patterns due to the data being captured during different time periods.
One practical advantage of using the UMI method described in this study over the U-Haul total one-way arrival method and the Redfin home search method, is that U-Haul prices required for UMI calculations are readily available from the U-Haul website, while one way arrivals are available only to U-Haul, and home search data is available only to Redfin.
I used imbalances in U-Haul pricing data to generate U-Haul Moving Indices (UMI) for 100 US cities. This work builds on the ideas of Mark Perry in order to generate (near) real-time estimates of net people flows (Out- and Inflow) for each city.
While these data may reflect near real-time patterns of migration, they do not provide explanations why people are moving to- and from cities. Others have argued that taxes, cost of living, etc. spur people to leave cities and move on where conditions are better.
One of the assumptions I made (as did Perry) is that price is dynamically determined by U-Haul based on local supply and demand of rental trucks. U-Haul may use other factors to set truck rental prices.
This is a work in progress and I may update this post over time. If you have feedback or questions, I’d love to hear from you.
It’s important that there is a good fit between me and my clients. Here are some of my thoughts on assessing goodness of fit.
In an ideal relationship, the client:
Is a leader in biotechnology or life sciences industry. She may not work for a biotech company, but she is involved or affiliated with the biotech industry in some capacity.
Is facing a key challenge involving his data, data systems, or work culture around data. He knows their company is not getting the most from their data, and that it is time for some strategy and further investment.
Is a decision maker and she is motivated to maximize the value of the solution during our engagement. This is not necessarily the same thing as minimizing the cost of the solution during our engagement.
Recognizes that his situation is likely to differ from other situations, and will require some diagnosis on my part before jumping into a solution.
Is prepared to describe the problem or challenge she is facing.
Is not simply looking for a pair of hands to execute a self-prescribed task list.
And I can offer my best work when:
The engagement starts at high-level with strategy and advice, and transitions to the execution of customized data science deep dives, projects or analyses.
The client’s problem is best tackled by a combination of my background in experimental biology, my experience in synthetic biology, and my expertise in developing assays, workflows and data pipelines & systems to support decision making in industry.
The situation can benefit from an outside perspective, and a diagnosis that is not affiliated with specific enterprise solutions or products.
I use data to test and support a client’s intuitions about their science and process & business.
If these ideas resonate with you, then we just might be a good fit and you should get in touch.