Lead Data Engineer Career Guide

Author:Murphy | View: 26266 | Time: 2025-03-22 23:26:39

In this story, I would like to talk about the essential skills and knowledge I would want to equip myself fifteen years ago to succeed in the Data Engineering space. Leadership role in engineering requires more technical focus and hands-on activities to guide the dev team towards that optimal and desired outcome set as a project goal. There is a lot to discuss in architecture and technical standards, communicating effectively with major stakeholders, and ensuring that projects are delivered to a high degree of technical quality. Ideally, I would like to rewind fifteen years of my career back and see what I need to become a successful data engineering lead. Throughout my almost fifteen-year career in analytics and tech, I have seen many things. This story is a summary of lessons I have learned.

What is a Lead Data Engineer?

It is true that there aren't many leadership resources available for Lead Data Engineers, and most tech leadership references are for Data Analytics Managers overseeing the team of analysts to ensure that projects are delivered on time and within the budget. Data engineering is an exciting and very rewarding field. The companies will always hire someone who knows how to process (ETL) data efficiently.

It definitely won't be boring and it pays well.

In one of my previous stories, I wrote about the role without the "Lead" prefix, what's included, what to expect and how to comply with requirements [1]. However, the "Lead" prefix requires more focus on leadership and soft skills development.

How to Become a Data Engineer

The role pays well because not an easy task to build a good and efficient data platform that provides value. It requires considerable technical expertise and solid coding skills. Consider data engineering as a combination of analytics and software engineering. The role would require certain abilities to merge these skill sets to build a robust data platform where all data pipelines can be orchestrated programmatically.

Adding a leadership component would require a visionary strategist, thoughtful mentor and successful communicator able to work closely with everyone starting from project teams to external vendors and third-party stakeholders.

Having said that we can produce the role definition. A Lead Data Engineer is a senior-level technical lead with a software engineering background and excellent knowledge of database design willing to take on extra leadership responsibilities. This typically includes strategic decision-making, communication with external vendors, coordinating a team of developers, managing and overseeing projects and also mentoring which is a big chunk of this set of responsibilities.

Lead Data Engineers need to possess strong people skills and critical thinking abilities.

Very often Senior Engineers are promoted to lead roles solely based on their technical expertise. So when they get there they lack leadership experience and need to improve soft skills as soon as possible. I saw it quite a few times and I experienced it as well. I was lucky enough given enough time to learn the ways of effective communication, decision-making, self-awareness and emotional intelligence.

I tried to build trust wherever it was possible

This set of skills and willingness to listen helped me a lot to build further relationships. I would say that building trust and showing empathy are the most important ones as they tend to drive a significant productivity increase for your team.

At the end of the day, it is all about the team and how successful we are working closely with each other.

Lead Data Engineer skillset. Image by author.

In my experience, very often you don't need to have both technical and soft skills to get promoted. However, it might depend on the company you are employed with. The more corporate it is the higher the probability HR would assess your leadership skills. It is true that in many tech companies it is not required and good companies provide training for that.

Lead Data Engineers are usually tasked with the following:

Data platform design and architecture
Data pipeline development
Infrastructure provisioning – Yes, that's right. Good data engineers do what DevOps usually do.
Documentation – Maintaining docs is crucial. No doubt.
Fixing data bugs and errors – Probably the most frequent task.
Mentorship – Working with the team of Junior devs on coding standards, etc.
Managing data projects – This would involve working closely with Project Managers to ensure projects are delivered on time, i.e. estimates calculation, etc.
Working with developers to communicate data project requirements and specifications
Presentations and working with the wider team to explain the intricacies of data pipelines, etc.
Developing coding standards and managing deployments

So the latter four are the tasks that would differentiate Lead Data Engineer from Senior and require more responsibility and leadership skills.

Again in my experience, interaction and communication with major stakeholders during the project design stage is crucial and probably is the most important task. Effective communication is the key. This type of feedback on business and technical requirements will provide data to support the overall technical approach for the data platform or project design.

It helps to generate accurate project estimates.

Working through these "Lead" or "Leadership" tasks might be challenging but it takes practice and time. I remember it took quite a lot of time for me to get comfortable with it.

Lead Data Engineer career path

This is all about other roles that might potentially lead up to a more senior data engineering position – Lead Data Engineer or Lead Data Architect. Some companies split and differentiate these two roles, and some don't but in my experience, it is often a combined role. I witnessed Lead Data Architects who acted more like engineers and were able to write code. In my opinion, the Data Architect role would be more focused on data platform design, illustrating data processing components and relationships between them and explaining how data architecture will be built.

It is fundamental for a Lead Data Engineer to have both technical and business expertise. They are required to transform business requirements into technical terms, which is why they must be familiar with the product domain. If you don't fully understand what your stakeholders need, you won't be able to plan correct technical requirements or estimates.

Many years ago I started my career in compliance working for one of the central banks and now I see that it was an invaluable experience.

It is fascinating how business knowledge and experience like that might become useful in future. It helps me a lot these days as I keep seeing revenue reconciliation and fraud prevention data pipelines almost in every business. Compliance, PII and Financial Conduct Authority (FCA) regulation rules are everywhere. Should you find the business aspect of the Lead Data Engineer position enjoyable, you might want to think about pursuing a career in executive leadership or management. You have a lot of options if you want to work as a technical director, or even chief technology officer (CTO).

The typical progression for a Lead Data Engineer would include career steps from Junior to Senior Data Engineer and this is very straightforward. As a Junior Data Engineer, we would want to know the basics, i.e. the following:

SQL and data transformation,
data cleansing and quality checks,
good knowledge of Python at least,
experience with cloud data platforms,
understanding of data pipeline orchestration.

For instance, during a job interview [2] you might be asked about data modelling and how to update an incremental table.

Data Engineering Interview Questions

Many people believe that becoming a developer requires having a degree in computer science but this is not entirely true and there are a lot of self-taught data engineers. Many people I know have a major in non-technical subjects, such as psychology, philosophy, and history.

You don't have to commit your entire life to one career.

You can take a fresh approach and try new things to provide your own point of view on working with data. Innovation comes from the diversity of people working together and I really like the idea that anyone can master data engineering or data science thanks to various certification programs and learning courses available online. For instance, I started as an intern many years ago and I can say that it was the most powerful, exciting and life-changing thing I did. It was a marketing role but then I progressed into data science.

Data science and digital marketing have a lot in common.

Technical training is another profession that is well-suited for data engineering. If you have experience in technical writing as well it might be even more beneficial for your next career move towards data engineering. The key thing here is that you design the process of how students learn and the many teaching techniques that might be employed. Long story short, it's good for roles where mentoring is required.

Data analytics and business intelligence (BI) is another profession that presents opportunities for a career in development in the data engineering space. It requires proficiency in statistics, reporting, and dashboard design which is closely connected to data modelling. A few things to learn in coding techniques and it can easily get you a desired position in data engineering. Diving deeper into the data is always useful as it helps to understand the business process behind it. Working in BI you learn how to provide insights using data. These insights can potentially generate greater value for business.

Data engineers cleanse and enrich data and often SQL is a perfect tool for this. Consider this code below. It will use SQL for incremental table updates:

create temp table last_online as (
    select 1 as user_id
    , timestamp('2000-10-01 00:00:01') as last_online
)
;
create temp table connection_data  (
  user_id int64
  ,timestamp timestamp
)
PARTITION BY DATE(_PARTITIONTIME)
;
insert connection_data (user_id, timestamp)
    select 2 as user_id
    , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 20 hour) as timestamp
union all
    select 1 as user_id
    , timestamp_sub(current_timestamp(),interval 1 hour) as timestamp
;

merge last_online t
using (
  select
      user_id
    , last_online
  from
    (
        select
            user_id
        ,   max(timestamp) as last_online

        from 
            connection_data
        where
            date(_partitiontime) >= date_sub(current_date(), interval 1 day)
        group by
            user_id

    ) y

) s
on t.user_id = s.user_id
when matched then
  update set last_online = s.last_online, user_id = s.user_id
when not matched then
  insert (last_online, user_id) values (last_online, user_id)
;
select * from last_online
;

SQL is fundamental not only for data engineers but for data developers, analysts, data science and BI practitioners.

It is a universal dialect for data analysis.

Using just SQL we can run all possible data cleansing and quality checks using row conditions.

Data quality checks using row conditions. Image by author.

Consider this BigQuery SQL below. It will check row conditions to ensure our dataset doesn't have any quality issues:

with checks as (
    select
      count( transaction_id )                                                           as t_cnt
    , count(distinct transaction_id)                                                    as t_cntd
    , count(distinct (case when payment_date is null then transaction_id end))          as pmnt_date_null
    from
        production.user_transaction
)
, row_conditions as (

    select if(t_cnt = 0,'Data for yesterday missing; ', NULL)  as alert from checks
union all
    select if(t_cnt != t_cntd,'Duplicate transactions found; ', NULL) from checks
union all
    select if(pmnt_date_null != 0, cast(pmnt_date_null as string )||' NULL payment_date found', NULL) from checks
)

, alerts as (
select
        array_to_string(
            array_agg(alert IGNORE NULLS) 
        ,'.; ')                                         as stringify_alert_list

    ,   array_length(array_agg(alert IGNORE NULLS))     as issues_found
from
    row_conditions
)

select
    alerts.issues_found,
    if(alerts.issues_found is null, 'all good'
        , ERROR(FORMAT('ATTENTION: production.user_transaction has potential data quality issues for yesterday: %t. Check dataChecks.check_user_transaction_failed_v for more info.'
        , stringify_alert_list)))
from
    alerts
;

So when we schedule it it will alert us if potential data quality issues are found:

I previously wrote about it here:

Automated emails and data quality checks for your data

Learning some advanced techniques [3] would be very useful if we want to get into data engineering. Everything else can be derived from these "basic principles", i.e. concepts of CTEs and JOINs are similar everywhere. We can always consider Spark and Pandas data frames as CTEs too.

Advanced SQL techniques for beginners

Senior Data Engineer as a role comes with more experience. So if you are being promoted to Senior Data Engineer I would assume you are familiar with data environments [4], and CI/CD techniques and feel comfortable writing data transformation unit tests. Consider this example of a data platform split between production and staging:

Data environments setup using Infrastructure as Code. Image by author.

One of the primary technical advantages of CI/CD is that it improves overall code quality and saves time.

Aa a Senior you must be confident working in data environments and be comfortable using relevant deployment tools and techniques.

Continuous Integration and Deployment for Data Platforms

Unit testing should be a part of your deployment strategy. Every time we deploy a data pipeline it has to be unit-tested [5]. Consider a data pipeline with an orchestration service.

Service architecture with tests. Image by author.

We might want to test a few things:

function logic, i.e. processEvent() function of our orchestrator. We would want to make sure that this particular function returns the expected result consistently when we provide some input for it.
We also might want to write a few integration tests Imagine that we need to test how our service interacts other services, sends a request and gets a response in return.

Unit tests are very powerful. As a Senior we would want to use it very often in CI/CD pipelines. I worte about my Node.JS setup for this here:

Test Data Pipelines the Fun and Easy Way

How do we test the logic inside this function?

unittest makes it really simple:

# ./prime.py
import math

def is_prime(num):
    '''Check if num is prime or not.
    '''
    for i in range(2,int(math.sqrt(num))+1):
        if num%i==0:
            return False
    return True

Now if we run this in our command line we will test the logic [6]:

python -m unittest test.py
# Output:
# .
# ----------------------------------------------------------------------
# Ran 1 test in 0.000s

For integration tests, we can mock desired responses and outcomes too.

Let's imagine we have an ETL service that pulls data from some API and it takes a lot of time. Then our service will transform this dataset and we would like to test that this ETL transformation logic persists.

Pulling data from any services, databases or an API might take a lot of time but we would want our unit test to run fast. We can mock some fake API responses into our get_data() function and then use it to test ETL logic in the save_data() function:

# ./test_etl.py
import unittest
from asteroids import *

import unittest.mock as mock

class TestEtl(unittest.TestCase):

    def test_asteroids_etl(self): 
        with mock.patch('asteroids.get_data') as GetDataMock:
            GetDataMock.return_value = ['asteroid_1', 'asteroid_2']
            self.assertEqual(['1', '2'], save_data())

The output will be:

AssertionError: Lists differ: ['1', '2'] != ['asteroid_1', 'asteroid_2']

First differing element 0:
'1'
'asteroid_1'

- ['1', '2']
+ ['asteroid_1', 'asteroid_2']

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (failures=1)

I previously wrote a tutorial about it here:

Python for Data Engineers

So data environments [4], CI/CD techniques, unit testing data pipelines, and writing data transformation unit tests are crucial skills for Senior Data Engineers. A typical Senior Data Engineer would be feeling very confident working with all these but with seniority comes greater responsibility. Usually, it would require mentoring.

We wouldn't have any Senior or Lead Developers if we didn't teach and guide Junior Developers!

When I'm recruiting Senior Developers, I'm seeking individuals who are willing to collaborate and mentor junior members.

Conclusion

In this story, I tried to dive into the technical and soft skills required for successful progression into the Lead Data Engineer role. Looking back on my career path I wanted to cover most common expectations and tasks that come with the role and tried to create a comprehensive guide on how to succeed as a Lead Data Engineer or Data Architect. As a Junior Data Engineer, you would probably want to learn the basics of data modelling and ETL techniques. The first couple of years you would be learning a lot. Troubleshooting errors and data quality checks would be a major part of your day-to-day responsibilities. Getting some relevant experience in data environment setup, coding and microservices, CI/CD techniques, unit testing data pipelines and data transformation unit tests should help you get a promotion to a more senior data role. To be fair this is a difficult path but it pays off. This knowledge is useful not only in data engineering but also in any other data-related role, i.e. BI developer, Data Analyst and Data Scientist.

Employers seek candidates who possess a strong blend of business, soft, and technical skills when considering candidates for Lead Data positions. At some point, you might want to acquire some leadership skills. This might include strategic vision, mentorship and successful communication with external vendors and third-party stakeholders. It is fundamental for a Lead Data engineer to have both technical and business expertise. Strategy comes with experience. Strategic decision-making and being able to make these decisions fast is crucial in Lead Data roles. Effective communication, empathy, ability to listen and get feedback should help you build trust and relationships. In my opinion, this is probably the most important leadership component.

I hope you find this story useful. Thanks for reading.