A portrait of ETL Error Handling

Speaking of ETL Error Handling, this is how ZOLA Electric handled this problem.

Ernest Mwalusanya
Power the People

--

From previous blog, you probably have a clear understanding of how Data Infrastructure looks like at ZOLA Electric.

In this blog I will describe steps and measures we used to fix ETL errors as well as best practices we have adopted. Before I give you a description on how things used to be and how we handled them, let me give you a complete picture of internal ETL performance at ZOLA Electric for the past one year.

ETL performance at ZOLA Electric.

How it used to be
When building a Data Infrastructure, things can easily go out of hand due to bad designs or bad anticipations from the ETL flows that can lead to errors.

Few years ago at ZOLA Electric, the error rate in our ETL infrastructure was very low. This did not indicate that everything was running smoothly but it was due to many errors that happened without a proper way to capture them and provide notifications.

The picture below shows a low error rate before we started alerting on ETL errors.

Low error rates at ZOLA Electric before adopting ETL best practices.

The Transition Period
To fix problems on “how things used to be”, we had to do a number of things as a way to make sure that we have proper procedures to follow when designing and maintaining our ETL. Our transition period involved changes in following key areas:

Tools and Tech side:
1. Removed the problematic ETL parts (ones that became non-stable).

2. Introduced more advanced and easily maintained tools.

3. Started to frequently Review PR (Pull Requests) from dev codes.

4. Added automated tests that check for code quality and styles.

5. Introduced staging environment for testing all changes before pushing them to production. (This is where we captured most of errors and fix them).

ETL Execution:
1. Introduced explicit job dependencies that stop the entire chain of jobs if an error occurs in one of the dependencies.

2. Allowed jobs to retry in case of connection faults or network related errors.

3. Introduced jobs periodic runs.

Communication:
1. Produced more actionable alerts via e-mails or slack notifications.

2. Team members provide details on what, how and status of what they are fixing from reported errors.

Documentation:
1. Document frequent errors and best ways to avoid them and fix them plus possible ways to automate the process of detecting those errors and fixing them.

2. Redesigned parts that failed too often or had design issues.

3. Documented all parts of our ETL and how to maintain.

After establishing proper steps, things changed in our ETL. Picture below will show how things were during transition period. This is when we introduced proper alerts per each event in our ETL.

High error rates at ZOLA Electric during transition period

High error rate at transition period was mostly due to uncaptured errors on the old ETL infrastructure. The main improvement at this stage was having proper procedures to capture and fix all errors.

Fixing the majority of captured errors at this stage gave us a very good foundation to where we are now and where we want to be in future.

Where we are and where we are going

After introducing standard ways of design and maintaining ETL at ZOLA Electric, our overall ETL error rate dropped significantly.

Current situation: More stable ETL with low error rate.

Low error rates after capture and handle properly all ETL errors.

With current design of our infrastructure, our ETL is more stable and whenever errors occur, they now get captured, fixed and documented. If they are caused by connection / network related faults they even get fixed automatically, with periodic retries. Other errors like bad syntax in code rarely reach production due to code checks we have before deployment and thorough tests on staging.

--

--