Interworks is a people-focused tech consultancy specializing in business intelligence, data warehousing and data integration. To learn more, reach out to Chris Soule.
As your business grows, you’ll discover that you need reporting functionality beyond what your data sources natively provide. To answer more complicated questions, you’ll need to combine data from sources. Today, it’s standard practice to centralize data from sources both inside and outside of your organization in one location, often a data warehouse. This allows you to identify patterns you never knew existed, get better visibility into your operations, and help formulate new strategies. What’s not to love?
This practice, called data integration, requires a data pipeline, often referred to as an ELT (extract-load-tansform) pipeline. In the past, you may have heard the acronym ETL (extract-transform-load), but the industry has discovered that transforming data before it’s stored in your data warehouse is less sustainable, less flexible and harder to automate. George Fraser, founder and CEO of Fivetran, does a great job explaining the differences between these two pipelines in this video.
Data integration can be broken down into four categories:
- Data ingestion (extract, load)
- Data transformation (data cleansing, modeling)
- Data presentation (dashboards, reports)
- Data automation (affects all the above)
Although each of these steps affects scalability, performance and cost, data ingestion is the most critical part of the entire solution because it sits upstream of the rest. It’s also most prone to break for various reasons. Incoming data affects all downstream tasks. Issues in data ingestion can negatively impact stakeholders when instead of up-to-date, accurate information, they receive invalid or stale data. Let me set the stage for how easily some organizations go astray.
Labor Costs More Than Technology
At InterWorks, we’re often called in from the outset to help guide organizations throughout their data journey. Early decisions set you up for success or failure.
An common early pitfall is choosing the wrong tools and platforms. When you are choosing data pipelines, data warehouses, and business intelligence tools, you are evaluating your solutions’ monetary cost and capabilities as well as the potential number of hours your team will spend on development, maintenance and troubleshooting.
Usually, the most expensive cost of any solution is not the product itself but the human cost of talent and labor. This is especially true for data pipelines.
Choosing the Wrong Data Pipeline
A project from my past exemplifies how misguided early decisions can subsequently cause a client one setback after another. The client, a large organization, had always had a very small BI staff. This could have been fine, but the organization had also chosen a data pipeline that had a modest upfront and per-month cost, but was technically demanding, overly complex and hard to maintain.
This led the BI team to consistently lag on new projects, deliver solutions late and even miss maintenance schedules, which produced more fires for development and production to put out. The BI team constantly needed herculean efforts to get projects delivered on time (if at all) and keep them running.
The data pipeline they chose could have worked well in a different context. A larger staff, segmented appropriately with more structured project management, would have had an easier time maintaining the data pipeline. However, the reality is that few organizations can dedicate so many resources to such a complex, labor-intensive solution. Alternately, their smaller staff could have benefited from a solution with a higher monetary cost but more labor-saving characteristics.
This same scenario plays out in many organizations, and at the root is the tension of balancing cost with value.
Individuals responsible for proposing and maintaining their organization’s BI solution have a hard job and many competing demands, namely: Do it right and don’t spend a fortune! This cost sensitivity pressures BI leaders to find the best, most feature-rich solutions, identify cheaper alternatives and then mull whether their company really needs the best or can get by with something cheaper that does less. This approach runs a serious danger of overlooking costs beyond annual or monthly payments.
I propose, instead: Do what’s best, and if you need to spend money, justify it. Don’t be penny wise, dollar foolish. This brings us to our next topic.
Spend Money to Save Labor
As we’ve previously discussed, the human cost of using and maintaining a complex, technically intensive data integration solution is a common bottleneck to successful data integration. Such a solution may cost less up front and per month/year, but will cost the company dearly in the long run, especially with a smaller staff.
I always ask: Is the internal work to maintain this solution even worth the time and effort?
Instead, what if you choose a tool that costs more but features ease-of-use, automation and self-maintenance instead of one that’s just as capable but requires more knowledge, experience and skill to maintain, and is less intuitive to work in?
These questions can inspire the right conversations about what tools to include in your data pipeline. Let’s walk through what happens when you choose a less intuitive, higher maintenance tool.
How High-Maintenance ETL Tools Ruin Your Workflow
At first, nothing bad happens. Every new tool has a learning curve, and your staff is happily climbing it. With the help of vendors and consultants, your staff — small and mighty as they are — has ample time to set things up just right. They knock the socks off your end users and stakeholders with what they deliver. Everyone is a hero. All is well … until after the first production release.
Then comes the flood of requests from across the business, which is a testament to how well your staff has done. The first few requests aren’t bad, and you are able to deliver some other ad hoc requests that provide even more value.
But now errors have started to crop up. The more requests you put into production, the more complex your ecosystem becomes and the higher the probability of further errors for your staff to investigate and correct. You also need to ensure all the systems in your pipeline are current, including drivers and features, while keeping security top of mind, too.
Your staff quickly becomes overwhelmed, and the request to increase staff isn’t getting approved anytime soon, even as the demand for your solution builds. Tensions and stress show from the top down. Turnover begins and with that, your team starts to lose knowledge and expertise. Welcome to the doom spiral …
It’s easy to say that every solution your team puts into production must be perfect, but that’s unrealistic. Development work is hard and messy. Today’s perfect solution is obsolescent tomorrow. Many organizations use agile BI development to stay nimble as changes arise. Like most things in business, the solution starts a lot higher up the food chain than your BI staff, and it takes bravery and determination to address it.
ETL: Is There a Better Way?
How can you stave off this outcome and make your stakeholders happy? We mentioned earlier that data ingestion is one of the most error-prone pieces of the puzzle, with high maintenance costs. Data ingestion can really shape how much time the staff has to handle all the additional requests that come in, especially as data sources change over time. You’ll need to factor in the time required to develop custom connectors for your sources, profile the incoming data to determine if you can use a replication key to incrementally update or otherwise perform full loads of the data, set up a data refresh cadence based on end users’ needs, etc.
What to Look for in a Data Pipeline
Your data pipeline, used to perform data ingestion, is the first place I’d look for solutions to cut internal maintenance hours and accelerate development efforts for new value-add projects. The solution must:
- Offer data connectors to your data sources, and more generally a wide range of commonly used SaaS and operational databases
- Support your intended destination, i.e. cloud data warehouses and data lakes
- Have an excellent SLA with responsive support and low downtime
- Have a proven track record of updating connectors before they become outdated
- Be very simple and intuitive, with most important tasks automated and little or no configuration and custom coding for common data sources.
Fivetran fits this description.
Boasting over 150 different source systems and an extensive catalog of prebuilt connectors, Fivetran likely already supports the sources you want to ingest. The interface is extremely simple, allowing you to set up automated data ingestions in minutes. This frees up your staff to do more of the work they like and deliver more value to your stakeholders and end users.
Don’t just take my word for it — check out this review or that review discussing the time saved by outsourcing ingestion to Fivetran rather than spending untold hours designing, developing and maintaining custom-built ingestion pipelines.
Next, apply the same approach to other areas of data integration and ask whether it’s worth the time and effort to build and maintain an internal solution. How much will it cost you? (Not just the upfront cost but also the impact and cost of human hours.) Do you have the FTEs to not only maintain the ecosystem but also drive growth? These considerations will have a wide-reaching impact across your organization for years to come.
Here at InterWorks, we help organizations through this journey by being objective and constantly evaluating the best solutions in the space. If you would like to speak with us on how we can assist you with your BI ecosystem, rebooting your existing solution or maybe just starting down the path, we’d love to come alongside you as a trusted advisor.