Four Common Pitfalls in Data Engineering - Statistics.com: Data Science, Analytics & Statistics Courses

By Will Goodrum*

Note: A version of this article was first published on the Elder Research blog.

Your company has made it a strategic priority to become more data-driven. Good! A major anticipated component of this transition is to implement new data technology (e.g., a data lake). Resources are thrown at identifying source systems and pulling information into a new, analytically-focused data repository or an even bigger data lake. Time is spent creating an ETL pipeline to move data from one place to another. Web endpoints are created to facilitate access for data customers. Dashboards are created that show information available in this centralized and optimized data source. At a brief with the company executive team 12 months later, the excited response from the C-level is a resounding: So how has any of this effort made us more data-driven?

What is missing? How can they not see the value of what has been done? Achieving success with Data Engineering is like a long hike. It’s not just about getting gear and walking around in the forest. You need to know where you’re going, who can help, why you’re going, and what you’ll need. Similarly, data engineering requires clear use cases, a central governance function, a straight-line tie to ROI, and an identified audience. Miss any one of these in your initiative, and you risk falling into one of four common data engineering pitfalls.

What is Data Engineering? (and its common pitfalls)

First let’s define “data engineering.” Robert Chang wrote an excellent piece on Medium called A Beginner’s Guide to Data Engineering that illustrates many key ideas. He quotes a great article by Maxime Beauchemin called The Rise of the Data Engineer:

Data engineering (as a) field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering

Data Engineers are experts that work with data storage and transfer technologies to ensure that data are reliable, available, and accessible for anything from BI reporting to Deep Learning. As Chang states, data engineering is an important and adjacent discipline to data science. Experienced data scientists rely on and appreciate the complementary functions provided by data engineers.

As we have navigated many projects with clients over our 25-year history, we’ve encountered four common pitfalls that can undermine data engineering efforts and consume downstream value from data science:

Creating infrastructure without use cases
Centralizing data without governance
Beginning construction without estimating ROI potential
Designing an architecture without an audience

Let’s unpack each of those to get a better understanding of why they matter, why they get missed, and how to get them right.

Creating Infrastructure Without Use Cases

Despite excitement around “Big Data,” or proclamations about the petroleum-like value potential for data, data without insights is useless. It is just more data. Likewise, engineering for its own sake is wasted effort. Just like how filling a lake is only important if there are people who need the water.

Insights come from contextualizing data within the confines of a valuable use case. Consider the increasing risk of storing personally identifiable information. As the societal costs of data breaches become clearer, the monetary damages being levied against firms and the compliance requirements to safeguard against such failures are increasing. However, if the intended use of the data requires access to personally identifiable information like addresses or email addresses, then the cost (and associated risk) of implementing the high-security technology needed to store that information may still be worth the benefit available from modeling.

Ultimately, technology acquisition and development should be use-case driven. The array of cloud computing technologies is dizzying1, but fundamentally each is a tool designed for a specific purpose. In the same way that you only need a drill if you need to make a hole, you should only be investing in the software tools necessary to support the data project that you have in mind (either for the present or the near future).

The ultimate value of data engineering (like other infrastructure projects) can exceed its intended use-case. Based on his World War-II experiences with the Reichsautobahn in Germany, President Dwight D. Eisenhower developed a multifaceted vision for the Interstate Highway System. This ranged from freedom and safety of automobile travel, to defense applications, and from economic stimulus and job creation, to future economic benefits. However, he did not envision the specific economic and cultural impacts enabled by the Interstate Highway System, like the creation of WalMart or FedEx, or the suburbanization of major metropolitan areas. Major data engineering efforts can have similarly valuable and unintended impacts within organizations when guided by strategic use cases that warrant the initial cost of the infrastructure.

Centralizing Data without Governance

Advanced data engineering efforts, such as the construction of data lakes, are designed to centralize disparate information into a single location to accelerate the production of machine learning models by data scientists. These data may come from different organizational siloes, reside in old systems that do not integrate well with new technology, or may have usage limitations due to compliance or regulatory restrictions. So why do organizations centralize data into analytical repositories without robust data governance?

Consider the following questions:

What was the source system for the data in the data lake?
Who maintains/owns that system? Can I rely on it? Is it on premises or managed by a third party?
Who is responsible for updating the source data?
When does it become stale?
Who can add new and remove old and unused data elements?
Are records in your data warehouse updated in place, or are repeated records added over time?

These can seem like silly questions because the data lake architecture does not yet exist. However, failing to consider the answers to these questions could lead to unnecessary headaches 6 to 12 months down the road. Why? Employees involved in the initial integration may have churned, taking with them critical knowledge about the relationship of the architecture to source systems. Database tables and connections to other systems may break when source systems are changed. Data tables may expand over time, as users create more and more columns. Unfortunately, many of these new columns may be sparse (i.e., a large percentage of rows in the new columns are empty). Architects will get angry emails from users. Under pressure to deliver on their initiatives, middle managers will avoid the architecture at all costs. As frustrations pile up, eventually that pile reaches the C-suite. And they will question the return on investment they get from an expensive investment that is not delivering the radical data-driven transformation they hoped would lead to better bottom-line results.

This painful (but common) future can be avoided with open communication, a clear process, and a rigorous data governance plan.

Beginning Construction Without Estimating ROI Potential

Organizations often undertake activities even when the ultimate value is unknown. Research and Development, for example, can be vitally important to maintaining a competitive advantage, fending off disruptive innovation, and growing a business into a previously untapped (or underserved) market. By its very nature, R&D has an unclear value proposition because the outcome is unknown. Some hypotheses prove true; many do not.

Data engineering is not R&D

The ROI for any data engineering effort should (and can) be estimated prior to construction, not sought retroactively after completion. Estimating the value of data is an open question. At Elder Research, we have found value in the framework proposed by Douglas B. Laney in Infonomics as a starting point for tying data to tangible business value. Data is an asset, but assets have varying value. The ROI from a data engineering initiative should be rooted in the value of the underlying source systems and the estimated benefit from their integration. One benchmark is the time required to access data compared to the current architecture. In some instances, the reductions can be several orders of magnitude (resulting in thousands if not millions of dollars saved).

In the same way that constructing Interstate highways had unexpected benefits (and costs) to the economy, this benefit is estimated routinely by planners, designers, and economists given reasonable assumptions. Data Engineers and Data Scientists can work together to make similarly reasonable estimates on the return of data engineering efforts prior to construction.

Designing an Architecture Without an Audience

Finally, a data architecture must have an audience in mind. Different audiences will have different expectations for what they will find. It’s just like with buildings. The needs and expectations of people visiting a hospital differ from those attending a school, or a movie theater, or a church. However, all are engineered structures that have to serve large numbers of people. Miss on the audience and you get a building that at best is a little strange; at worst, it undermines the purpose for which it was created.

Data Infrastructure is costly, as well. But when thoughtfully designed, it will enhance the work of others. The end-state required from the architecture must match the requirements of a primary audience. A pipeline to support BI Analysts may be very different from one designed to support experienced Data Scientists. Data Scientists are perfectly comfortable writing Python scripts to access data via an API endpoint (and probably prefer this!). BI Analysts likely prefer tables, spreadsheets, or Tableau dashboards. As with many technologies, the simpler the end use, the more complex the underlying engineering — all of this is driven by the requirements of the intended user.

Keys to Successful Data Engineering

To successfully create value from data engineering in your organization, we recommend the following:

Know your audience
Understand their needs and expectations for data access
Know who will govern your data and the details of the governance plan
Determine and secure consensus on how you will measure success for the final architecture

Data engineering is a valuable role on its own. And, it is vital to support the work of Data Scientists. Surveys of practicing Data Scientists have revealed that many leave jobs due to the significant (and unexpected) burden of performing data engineering tasks they were not trained for. Retaining top talent may rely on providing them with sophisticated engineering support so that they do not feel encumbered to build it themselves. Justifying the expense/investment in engineering will require close support from your Data Scientists, as well as the business and operational stakeholders who own the underlying data sources. Ultimately, an integrated approach will lead all stakeholders safely down the data engineering trail.

[1] For example, Amazon Web Services offers over 500 cloud-based solutions for data and analytics use cases, alone.

*Dr. William Goodrum has nearly a decade of experience in the management and delivery of projects and products that embed Data Science and numerical methods in software. At Elder Research, Dr. Goodrum leads a team of six Data Scientists who deliver custom Data Science training and create advanced analytical solutions and strategy for private sector clients around the globe. Dr. Goodrum has experience consulting across different industries, including logistics, software, and philanthropic development. Additionally, Dr. Goodrum has acted as PI on a NASA Phase II STTR program that implemented validated models of corrosion behavior for gas turbine engine rotors. Prior to Elder Research, Dr. Goodrum worked at a global engineering software firm where he supported customers in the Aerospace & Defense, manufacturing, and automotive industries. Dr. Goodrum’s PhD research estimated lifetime highway maintenance costs for the government of New South Wales, Australia.