There are many ways that Data Science could add value. In theory.
In practice, I’ve found that there is one way for Data Science to add value, and that way is in supporting a decision making process.
Note that supporting a decision making process is not the same as “making the decision” — ultimately this must rest with a human that is accountable for the decision. This may mean that if you use software to “make decisions,” you are still responsible and accountable as you decided to employ the software as your proxy.
The real opportunity arises when a decision has uncertainty (or ambiguity) and some “data science” — read system and associated analyses — can be applied to 1) quantify the uncertainty, and 2) reduce the uncertainty for the decision. Using information to reduce uncertainty adds value that can be quantified.
What else does Data Science do?
-is it a workflow, or pipeline or database architecture? Maybe in part, but only if it facilitates decision making.
-knowledge management? Maybe in part, but only if it enables decisions based on that knowledge.
-improve understanding? Perhaps, but how do you translate that improved understanding into value? –only by your subsequent actions, i.e., decisions.
In the end, whether you invest in Data Science should really depend on whether you have decisions where you want to reduce the uncertainty associated with different actions or choices.
When taking on a new role, you’ll inevitably have new tasks and responsibilities. Some of those tasks will be repeated, and at some point you might wonder whether automating the task would be more efficient than continuing to do the task manually.
For me, the threshold for automation typically happens after I’ve done a task just once. And knowing I may have to repeat a task actually influences how I undertake the task the first time. That is, while doing the task, I’m asking myself how I might change what I need to do right now, based on the likelihood that I would have to do the task 10 more times, or 1,000 more times.
I’ve worked in biotechnology companies where we would typically run an experiment, and after analyzing the results, decide to re-run the experiment — perhaps with a minor tweak to the protocol, or using a different set of subjects or experimental conditions. In some cases, the insights derived from the experiments pushed us to make the experiment operational, that is, a regular and repeated part of our workflow. This meant that we would periodically run the experiment once a week, or once a month, to generate new data and glean new insights.
Whether the actual experiment could or should be automated would depend on the nature of experiment.
However, analyzing data from the experiment was almost always a candidate for automation, and if I knew I had to re-fresh the data and do the analysis again, I would always write R scripts to automate the data analysis and report generation. For me, I wanted to automate the parts of the workflow that took time, required attention to detail and precision, were well-defined, and were often mundane and boring and error prone.
Once achieved, the automation gives one the time to think about the message in the data, and the time to translate the insights from the data into value for the organization.
Once you’ve decided that a task — like data analysis — should be automated, you are faced with the decision of How you go about automating the process? The answer may depend on whether you and your colleagues have the skills to do the automation, or the capital to hire others to do the automation, or purchase a drop-in solution that enables automation (e.g., software, etc.).
As a grad student, the balance of time and money favored me choosing to learn the tools (in this case statistics, R, python, and SQL) to do the automation myself.
However, in a business environment, learning these tools may not be the best option. Learning these tools takes a lot of time and effort, and that equates to an opportunity cost, i.e., your time and energy may be better spent applying your expertise and thinking about your problem domain. Remember that these tools ultimately result in a system that automates the task in question. This system can involve software (with source code), databases (with supporting architecture), web services (with access and security), and the time to address new issues (changes in requirements) as they arise in your process. With these considerations in mind, it may make sense to hire dedicated staff (if the process and system is large enough to warrant this choice), or purchase or subscribe to a software system that meets your automation needs.
Having said that, here’s a shameless plug for a new service we’re offering:
Yukon Data Solutions is our data analytics and report generation service. We work with biotechnology and life science teams to automate their data analytics pipelines. We focus on the rapid development and deployment of light-weight, customized automation solutions. Our goal is to take care of the painful parts of data analysis and report generation, so you can focus on the more important task of thinking about what it all means. Drop us an email if you want to determine if Yukon Data Solutions might be a good fit for your needs. One last thing: Our service is 100% satisfaction guaranteed.
A post by Mark Perry about U-Haul truck rental prices suggested that so many people were moving away from San Jose, CA that it was causing shortages of U-Haul rental trucks in San Jose. In response to high demand for rental trucks, U-Haul was adjusting one-way truck rental prices for leaving San Jose to be many multiples of the prices to move from the same cities to San Jose. For example, Perry reported that the price to rent a truck to move from San Jose to Las Vegas, NV was 16x more than the price to move from Las Vegas to San Jose.
Perry suggested that U-Haul would use dynamic pricing to optimize one-way truck rental prices in response to local supply and demand, and that we could use the ratio of Outbound moving prices compared to Inbound moving prices to infer net movement of people between pairs of cities. I wondered if we could use this idea to get a real-time measurement of movement patterns (“migration”) among other US cities?
Because Perry only reported on price imbalances between San Jose and six other cities, I wanted to start by looking more broadly at pricing imbalances across the US. I collected U-Haul pricing data for the 100 largest US cities (by population), with the intention of ranking cities for Outflow (and Inflow), based on average imbalance in truck rental prices.
I paired each focal city (“A”) with every other city (“B”), and used the U-Haul website to collect one-way pricing quotes to rent a 10′ truck to move from A to B (Outbound), and from B to A (Inbound). I calculated the log(Outbound/Inbound) for each city pair, and then used the average result of all city pairs (for each focal city) to generate an index of migration for that city. I called this index the U-Haul Moving Index, or UMI.
For example, using San Jose, CA as the focal city (A): The Outbound and Inbound one-way truck rental prices between San Jose and Tucson, AZ (B) were $1271 and $161 respectively. The Outbound/Inbound is $1271/$161 = 7.9 (and taking the log of 7.9 yields 2.06). This value was calculated using San Jose as A, and all cities as B, and the median value (“UMI”) for San Jose was 0.69.
[I converted the ratio of Outbound/Inbound to a log scale because taking the natural log of a ratio transforms the asymmetric linear scale into a symmetrical log scale. In this way, an Outbound price that is 2x the Inbound price will have the same scaling as an Outbound price that is 0.5x the Inbound price, i.e. a factor of 2 in both cases.]
The main results are illustrated in Figures 1 and 2 below.
Figure 1 ranks cities according to the U-Haul Moving Index (UMI). Cities having positive UMI (purple) are cities where Outbound prices are greater than Inbound prices, suggesting that trucks are in short supply, due to a net outflow of people. Cities having negative UMI (orange) are cities where Outbound prices are less than Inbound prices, due to a net inflow of people.
Figure 2 is a map of 100 cities, colored by U-Haul Moving Index (UMI), and highlights strong regional patterns in these data.
Regions having positive UMIs — dominated by people leaving the regions — include California, Chicago (and surrounding Lake Michigan states), New York City (and north eastern seaboard states), and Miami, FL.
Regions having negative UMIs — dominated by people arriving in the regions — include the south eastern states, Texas & Oklahoma, Arizona, and Boise, ID.
How do these data compare to other studies?
Previously, U-Haul has analyzed its own data to report on migration trends. U-Haul used total number of one way arrivals in 2017 to rank cities as US destinations, and found that Houston, TX and Chicago, IL were the top two destination cities. In contrast, UMI (this analysis) ranked Houston #42 (of 100) inflow cities, and Chicago as the top outflow city after California cities (Chicago was ranked 15/100). Additionally, U-Haul’s total one way arrival method ranked San Jose, CA as #42 in its list of top 50 destination cities, and this study using UMI ranks San Jose, CA as tied for top outflow city (#1 of 100). These differences in ranking may reflect differences in methodology. However, because data in these studies came from different time periods (2017 for one way arrival data, and June 2018 for UMI data) it is conceivable that differences in city rankings are due to underlying differences in migration patterns rather than methodology. Whether one method is more accurate at describing patterns of migration has not been determined. In my opinion, however, the total one way arrivals method used by U-Haul appears to ignore the numbers of one-way departures, and may therefore not present a full accounting of inflows and outflows required to calculate the net flow of people to/from the city. Rental pricing — if dynamically optimized in response to local supply and demand — could better integrate information about in- and outflows of people.
The company Redfin has used house search data to estimate the movement of people and ranked San Francisco, New York and LA as top outflow cities (with Chicago as #5) — a result more in line with the UMI rankings in this study. However, Redfin also identified Sacramento, Phoenix and Las Vegas as top Destination cities, while the UMI in this study strongly ranked Sacramento as an outflow city and Las Vegas as having more balanced in- and outflow of people. (Phoenix had a moderately negative UMI suggesting it was a moderate inflow city). As with the U-Haul total one-way arrival method, it is unknown whether differences in city rankings between Redfin and this study stem from differences in methodology or differences in moving patterns due to the data being captured during different time periods.
One practical advantage of using the UMI method described in this study over the U-Haul total one-way arrival method and the Redfin home search method, is that U-Haul prices required for UMI calculations are readily available from the U-Haul website, while one way arrivals are available only to U-Haul, and home search data is available only to Redfin.
I used imbalances in U-Haul pricing data to generate U-Haul Moving Indices (UMI) for 100 US cities. This work builds on the ideas of Mark Perry in order to generate (near) real-time estimates of net people flows (Out- and Inflow) for each city.
While these data may reflect near real-time patterns of migration, they do not provide explanations why people are moving to- and from cities. Others have argued that taxes, cost of living, etc. spur people to leave cities and move on where conditions are better.
One of the assumptions I made (as did Perry) is that price is dynamically determined by U-Haul based on local supply and demand of rental trucks. U-Haul may use other factors to set truck rental prices.
This is a work in progress and I may update this post over time. If you have feedback or questions, I’d love to hear from you.