Data Engineering Interview Questions

Author:Murphy | View: 20478 | Time: 2025-03-22 23:54:47

This story aims to shed some light on various data engineering interview scenarios and typical discussions. It covers almost every question you might be asked and I hope it will be useful for beginner and intermediate-level data practitioners during the job interview preparation. Throughout my almost fifteen-year career in analytics and data engineering, I interviewed many people and now I would like to share my observations with you.

Are data engineering interviews tough? No, not really if you understand what you are dealing with. Many companies have tech blogs where they describe their stacks and the tech they use. I would recommend doing some research beforehand.

Data engineering interviews are quite simple per se and the job is very rewarding.

The interviews are quite simple indeed as questions follow the same pattern typically. The number of data platform types [1] is limited to just four and that would define the answer helping you to pass. So if we know what we are engineering then it's not a very big task to answer interview questions correctly.

Data Platform Architecture Types

Data engineering (DE) interviews are easy to pass unless you are tasked with coding. This is a whole different story and usually, this would be the second part of the interview process. Below is my collection of DE interview questions and answers. Enjoy!

What is your DE like on a day-to-day basis?

Usually hiring managers start the conversation with this simple question. Here we would want to demonstrate the abundance of enthusiasm and experience with various DE tools and frameworks. Provide some data pipeline examples to decorate your answer. It can be a couple of data pipelines you built or a full life cycle project with a data warehouse in the centre of this infrastructure. Don't call it a tutorial. It is always better to say something like…

"… a full-lifecycle project from requirements gathering to data pipeline design and go live."

It looks more professional and this is the impression you would want to create. Try to be concise but also be fluent in describing your day-to-day work. For example, you can say that you are a student, your main focus is data quality at the moment and you designed and built data pipelines to check data using row conditions in the first place before loading data into the data platform. Alternatively, you could mention that you know how to work with SDKs to load data into the data warehouse, etc. You can find some good examples in this article [2]:

Python for Data Engineers

It is not very difficult. You can say that you have various data sources on the left-hand side and you can create data pipelines following this pattern below to integrate them into your data warehouse (DWH) solution.

Consider this example of data loading into a BigQuery data warehouse using Pandas and google.cloud libraries [3]:

from google.cloud import bigquery
from google.oauth2 import service_account
...
# Authenticate BigQuery client:
service_acount_str = config.get('BigQuery') # Use config
credentials = service_account.Credentials.from_service_account_info(service_acount_str)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)

...
def load_table_from_dataframe(table_schema, table_name, dataset_id):
    #! source data file format must be outer array JSON:
    """
    [
    {"id":"1"},
    {"id":"2"}
    ]
    """
    blob = """
            [
    {"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]},
    {"id":"2","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
    ]
    """
    body = json.loads(blob)
    print(pandas.__version__)

    table_id = client.dataset(dataset_id).table(table_name)
    job_config = bigquery.LoadJobConfig()
    schema = create_schema_from_yaml(table_schema) 
    job_config.schema = schema

    df = pandas.DataFrame(
    body,
    # In the loaded table, the column order reflects the order of the
    # columns in the DataFrame.
    columns=["id", "first_name","last_name","dob","addresses"],

    )
    df['addresses'] = df.addresses.astype(str)
    df = df[['id','first_name','last_name','dob','addresses']]

    print(df)

    load_job = client.load_table_from_dataframe(
        df,
        table_id,
        job_config=job_config,
    )

    load_job.result()
    print("Job finished.")

How do you create data pipelines?

You would want to make clear that you are confident working with both third-party ETL tools (Fivetran, Stitch, etc.) and bespoke data connectors you can write yourself. A data pipeline is something that extracts, transforms and/or loads data from point A into the destination at point B [4].

Data pipeline design patterns

So all you need is to demonstrate that you know how to do it following three main data pipeline design patterns – batch (aggregate and process in chunks), streaming (process and load record by record), change data capture (CDC, identify and capture changes at point A to process and load into B).

CDC and streaming are closely connected.

For example, we can use MySQL binary log file to move data into our DWH solution in real time. It must be used with care and is not always the most cost-effective tool for data pipelines but it is worth mentioning this. Keep everything in order following the conceptual design diagram. It helps to explain many ETL things.

Conceptual data pipeline design. Image by author

What do you know about data platform design?

In a nutshell, there are four data platform architecture types that would define the selection of tools you might want to use while building a pipeline. This is the key to this question – it helps to choose the right DE tools and techniques. Data lakes, warehouses, and lake houses each have their benefits and serve each purpose. The fourth architecture type is Data Mesh where data management is decentralised. Data Mesh defines the state when we have different data domains (company departments) with their own teams and shared data resources. It might seem a bit more chaotic but many companies choose this model to reduce data bureaucracy.

Typically data warehouses offer better data governance compared to data lakes. It makes the data stack look modern and flexible due to built-in ANSI-SQL capabilities. The shift to a lake or data warehouse would depend primarily on the skillset of your users. The Data warehouse solution will enable more interactivity and narrow down our choice to a SQL-first product (Snowflake, BigQuery, etc.).

Data lakes are for users with programming skills and we would want to go for Python-first products like Databricks, Galaxy, Dataproc, EMR.

What is data modeling?

Data modelling is an essential part of data engineering as data is being transformed using relationships between entities (tables, views, silos, data lakes). You would want to demonstrate that you understand how this process works in terms of the conceptual and physical design process. We always start with the concept of creating a model for our business process or a data transformation task. Then it is followed by a functional model which is a prototype and it aims to prove that our conceptual model works for this task. In the end, we will create a physical model which contains the final infrastructure including all required physical entities and objects. It's good to say that it doesn't have to be SQL entities always. Conceptual data modelling might include all types of data platforms with semi-structured data files in the cloud storage. A good example would be a scenario when we need to prepare data in the data lake first and then use it to train the machine learning (ML) model. I previously wrote about it in this story:

Orchestrate Machine Learning Pipelines with AWS Step Functions

It's always good to mention that you are familiar with templating engines such as DBT and Dataform that can be used for this task. Why? It helps a lot with data transformation unit tests [4] and data environments [5], prevents human errors and provides better deployment workflows. I previously wrote about it here:

Continuous Integration and Deployment for Data Platforms

What is the difference between Star and Snowflake schema?

Very often job interviewers test your knowledge of data engineering design schemas. Try to be concise and say that Star schema is where we can take advantage of super large denormalised datasets connected to one fact table. That's why it's a Star database design pattern as it looks like a star. This is more suitable for data warehouse OLAP-style analytics pipelines. Data in those datasets is not always up-to-date but that's fine because we need it to be materialised this way and we can update the required fields if needed.

Opposite to a Star schema Snowflake schema design has the same fact table in the center but it is linked with many other fact and dimension tables which are typically denormalised. This schema design is more suitable for OLTP data processing when data needs to be always up-to-date and individual rows can be pulled fast to use in the application.

On a scale from 1 to 10 how good are your SQL skills?

Make sure you can explain your answer. SQL is a natural dialect to model data transformation and create analytics datasets. Working confidently with incremental table updates gives you 6 out of 10 straight away. Consider this example below. It creates an incremental table using MERGE:

create temp table last_online as (
    select 1 as user_id
    , timestamp('2000-10-01 00:00:01') as last_online
)
;
create temp table connection_data  (
  user_id int64
  ,timestamp timestamp
)
PARTITION BY DATE(_PARTITIONTIME)
;
insert connection_data (user_id, timestamp)
    select 2 as user_id
    , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 20 hour) as timestamp
union all
    select 1 as user_id
    , timestamp_sub(current_timestamp(),interval 1 hour) as timestamp
;

merge last_online t
using (
  select
      user_id
    , last_online
  from
    (
        select
            user_id
        ,   max(timestamp) as last_online

        from 
            connection_data
        where
            date(_partitiontime) >= date_sub(current_date(), interval 1 day)
        group by
            user_id

    ) y

) s
on t.user_id = s.user_id
when matched then
  update set last_online = s.last_online, user_id = s.user_id
when not matched then
  insert (last_online, user_id) values (last_online, user_id)
;
select * from last_online
;

I wrote about advanced techniques before. I think it's a good place to start the preparation [6]:

Advanced SQL techniques for beginners

Running SQL unit tests for data transformation scripts and working with custom user-defined functions (UDF) [7] would grant you 9 out of 10.

How do I get 10 out of 10 in SQL?

It would be something very tricky and obviously related to your expert knowledge of a particular tool, i.e. converting a table into an array of structs and passing them to UDF.

This is useful when you need to apply a user-defined function (UDF) with some complex logic to each row or table.

You can always consider your table as an array of TYPE STRUCT objects and then pass each one of them to UDF. It depends on your logic. For example, I use it in purchase stacking to calculate expire times:

select 
     target_id
    ,product_id
    ,product_type_id
    ,production.purchase_summary_udf()(
        ARRAY_AGG(
            STRUCT(
                target_id
                , user_id
                , product_type_id
                , product_id
                , item_count
                , days
                , expire_time_after_purchase
                , transaction_id 
                , purchase_created_at 
                , updated_at
            ) 
            order by purchase_created_at
        )
    ) AS processed

from new_batch
;

What is the difference between OLAP and OLTP?

Online analytical processing (OLAP) and Online transactional processing (OLTP) are data processing systems designed for completely different purposes. OLAP aims to aggregate and store the data for analytical purposes such as reporting and large-scale data processing, That's why denormalised super big tables are seen very often here. OLTP processing is different in the way we process data – it would have a single transaction focus and require lightning-fast data processing. Good examples are in-app purchases, managing user accounts and updating store content. Data for OLTP is stored in indexed tables connected using the Snowflake pattern where dimension tables are mostly normalised.

What data engineering frameworks do you know?

We can't know everything. I interviewed a lot of people and it's not necessary to have experience with all data engineering tools and frameworks. You can name a few: Python ETL (PETL), Bonobo, Apache Airflow, Bubbles, Kestra, Luigi and I previously wrote about the ETL frameworks explosion we witnessed during the past couple of years.

Modern Data Engineering

We don't need to be super experienced with all frameworks but demonstrating confidence is a must.

In order to demonstrate confidence with various data tools we would want to learn at least one or two and then use the basic principles (data engineering principles). Using this approach we can answer almost every DE question:

Why did you do it this way? – I got this from basic principles.

Having said this it would be just fine to learn a few things from Apache Airflow and demonstrate it with a simple pipeline example. For example, we can run ml_engine_training_op after we export data into the cloud storage (bq_export_op) and make this workflow run daily or weekly.

ML model training using Airflow. Image by author.

Consider this example below.

It creates a simple data pipeline graph to export data into a cloud storage bucket and then trains the ML model using MLEngineTrainingOperator.

"""DAG definition for recommendation_bespoke model training."""

import airflow
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.bigquery_to_gcs import BigQueryToCloudStorageOperator
from airflow.hooks.base_hook import BaseHook
from airflow.operators.app_engine_admin_plugin import AppEngineVersionOperator
from airflow.operators.ml_engine_plugin import MLEngineTrainingOperator

import datetime

def _get_project_id():
  """Get project ID from default GCP connection."""

  extras = BaseHook.get_connection('google_cloud_default').extra_dejson
  key = 'extra__google_cloud_platform__project'
  if key in extras:
    project_id = extras[key]
  else:
    raise ('Must configure project_id in google_cloud_default '
           'connection from Airflow Console')
  return project_id

PROJECT_ID = _get_project_id()

# Data set constants, used in BigQuery tasks.  You can change these
# to conform to your data.
DATASET = 'staging' #'analytics'
TABLE_NAME = 'recommendation_bespoke'

# GCS bucket names and region, can also be changed.
BUCKET = 'gs://rec_wals_eu'
REGION = 'us-central1' #'europe-west2' #'us-east1'
JOB_DIR = BUCKET + '/jobs'

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': airflow.utils.dates.days_ago(2),
    'email': ['[email protected]'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 5,
    'retry_delay': datetime.timedelta(minutes=5)
}

# Default schedule interval using cronjob syntax - can be customized here
# or in the Airflow console.
schedule_interval = '00 21 * * *'

dag = DAG('recommendations_training_v6', default_args=default_args,
          schedule_interval=schedule_interval)

dag.doc_md = __doc__

#
#
# Task Definition
#
#

# BigQuery training data export to GCS

training_file = BUCKET + '/data/recommendations_small.csv' # just a few records for staging

t1 = BigQueryToCloudStorageOperator(
    task_id='bq_export_op',
    source_project_dataset_table='%s.recommendation_bespoke' % DATASET,
    destination_cloud_storage_uris=[training_file],
    export_format='CSV',
    dag=dag
)

# ML Engine training job
training_file = BUCKET + '/data/recommendations_small.csv'
job_id = 'recserve_{0}'.format(datetime.datetime.now().strftime('%Y%m%d%H%M'))
job_dir = BUCKET + '/jobs/' + job_id
output_dir = BUCKET
delimiter=','
data_type='user_groups'
master_image_uri='gcr.io/my-project/recommendation_bespoke_container:tf_rec_latest'

training_args = ['--job-dir', job_dir,
                 '--train-file', training_file,
                 '--output-dir', output_dir,
                 '--data-type', data_type]

master_config = {"imageUri": master_image_uri,}

t3 = MLEngineTrainingOperator(
    task_id='ml_engine_training_op',
    project_id=PROJECT_ID,
    job_id=job_id,
    training_args=training_args,
    region=REGION,
    scale_tier='CUSTOM',
    master_type='complex_model_m_gpu',
    master_config=master_config,
    dag=dag
)

t3.set_upstream(t1)

What would you use to orchestrate your data pipelines?

It is important to differentiate the ETL frameworks we can use for data transformation and the frameworks we use to orchestrate our data pipelines. You can mention a few: Airflow, Prefect, Dagster, Kestra, Argo, Luigi. These are the most popular ones at the moment. These are open-source projects free to use. However, a good answer should indicate that you are capable of performing data pipeline orchestration using your own bespoke tools. If you like AWS you can deploy and orchestrate data pipelines using CloudFormation (Infrastructure as code) and Step Functions. I previously wrote about it here [9]:

Data Pipeline Orchestration

In fact, we don't even need Step Functions here as it would be a very platform-specific choice. We could use platform-agnostic Terraform (Infrastructure as code) and Serverless to deploy microservices with required data pipelines orchestrating logic.

What is your programming language?

The answer to this question depends on the company stack. Long story short, you won't miss it if you answer Python. This one is a coding absolute in DE and data science because of its simplicity and the numerous libraries and open-source data tools available in the market.

However, don't limit yourself with to Python.

It is always good to be familiar with other languages, i.e. JAVA, JavaScript, Scala, Spark and R. R for example is good for data science and is very popular among scholars and universities. It is always good to mention Spark. It's not a language (framework) but it became very popular due to its great scalability and capabilities for large-scale data processing [8].

You might not know Spark but if you know Python then you can always use a Spark API connector (PySpark).

What is *args and **kwargs?

Typically, it would be the next one if you named Python in the previous question. Answering a question about function arguments is the most common one I ask during job interviews. You would want to be ready to answer it and maybe even impress your interviewer with a few lines of code:

def sum_example(*args):
    result = 0
    for x in args:
        result += x
    return result

print(sum_example(1, 2, 3))

def concat(**kwargs):
    result = ""
    for arg in kwargs.values():
        result += arg
    return result

print(concat(a="Data", b="Engineering", c="is", d="Great", e="!"))

How good are you with CLI tools and shell scripting?

Cloud vendor command-line tools are based on REST API and enable data engineers with a powerful command-line interface to communicate with cloud services endpoints to describe and modify resources. Data engineers use CLI tools with bash scripting to chain commands. It helps to create powerful scripts and interact with cloud services with ease. Consider this example below. It will invoke the AWS Lambda function called pipeline-manager:

aws lambda invoke 
    --function-name pipeline-manager 
    --payload '{ "key": "something" }' 
    response.json

We can create something even more powerful to deploy our serverless microservices. Consider this example below. It will check if the storage bucket for the lambda package exists, upload and deploy our ETL service as a Lambda Function [10]:

# ./deploy.sh
# Run ./deploy.sh
LAMBDA_BUCKET=$1 # your-lambda-packages.aws
STACK_NAME=SimpleETLService
APP_FOLDER=pipeline_manager
# Get date and time to create unique s3-key for deployment package:
date
TIME=`date +"%Y%m%d%H%M%S"`
# Get the name of the base application folder, i.e. pipeline_manager.
base=${PWD##*/}
# Use this name to name zip:
zp=$base".zip"
echo $zp
# Remove old package if exists:
rm -f $zp
# Package Lambda
zip -r $zp "./${APP_FOLDER}" -x deploy.sh

# Check if Lambda bucket exists:
LAMBDA_BUCKET_EXISTS=$(aws  s3 ls ${LAMBDA_BUCKET} --output text)
#  If NOT:
if [[ $? -eq 254 ]]; then
    # create a bucket to keep Lambdas packaged files:
    echo  "Creating Lambda code bucket ${LAMBDA_BUCKET} "
    CREATE_BUCKET=$(aws  s3 mb s3://${LAMBDA_BUCKET} --output text)
    echo ${CREATE_BUCKET}
fi

# Upload the package to S3:
aws s3 cp ./${base}.zip s3://${LAMBDA_BUCKET}/${APP_FOLDER}/${base}${TIME}.zip

# Deploy / Update:
aws --profile $PROFILE 
cloudformation deploy 
--template-file stack.yaml 
--stack-name $STACK_NAME 
--capabilities CAPABILITY_IAM 
--parameter-overrides 
"StackPackageS3Key"="${APP_FOLDER}/${base}${TIME}.zip" 
"AppFolder"=$APP_FOLDER 
"LambdaCodeLocation"=$LAMBDA_BUCKET 
"Environment"="staging" 
"Testing"="false"

How do you deploy your data pipelines?

There are no right or wrong answers but if you say " I manually create pipeline steps and then deploy them in the cloud using vendor's console…" that wouldn't be the best answer. Now the good answer would be to mention scripts. This tells the interviewer that you are an intermediate user familiar with shell scripting at a minimum. You would want to say that whatever you deploy, can be deployed using bash scripts and CLI tools. All major cloud vendors have their command line tools and you would want to be at least familiar with one of them. The optimal way which is often considered as best practice is to deploy your pipelines using Infrastructure as code and CI/CD tools [11].

Continuous Integration and Deployment for Data Platforms

How good are you with Data Science?

As a data engineer you don't need to know all the intricacies of data science model training and hypertuning but remember that a good data scientist must be a good data engineer. Doesn't have to be vice versa but it is always good to demonstrate at least some knowledge of basic data science algorithms. For example, you can mention that you know how to create linear and logistic regression models. One creates quantitative output (a predicted number) when the other one returns a simple answer – "yes" or "no" (1/0). In fact, all major data science models can be easily trained using SQL inside your data warehouse solution.

Let's imagine our use case is churn prediction.

Consider BigQuery ML where we can create a logistic regression like so:

CREATE OR REPLACE MODEL sample_churn_model.churn_model

OPTIONS(
  MODEL_TYPE="LOGISTIC_REG",
  INPUT_LABEL_COLS=["churned"]
) AS

SELECT
  * except (
     user_pseudo_id
    ,first_seen_ts    
    ,last_seen_ts     
  )
FROM
  sample_churn_model.churn

What do you know about data quality and data reliability?

This is always a good question because you might be asked about possible ways to ensure data quality in your data platform. It is one of the data engineer's daily routine jobs to improve data pipelines in terms of data accuracy. Data engineers connect data sources and deploy pipelines where data must be extracted and then very often it has to be transformed according to business requirements.

We would want to make sure that all required fields exist (data quality) and no data is missing (reliability).

How do we do it? It's always good to mention self-fixing pipelines and that you know how to deploy them. Data engineers can deploy data quality pipelines in a similar way they deploy ETL pipelines. To put it simply, you would want to use row conditions for one dataset and based on the outcome deploy a fixing step, i.e. extract missing data and load it.

Using row conditions for your datasets aims to ensure data quality.

All data quality checks can be scheduled as scripts and if any of them fail to meet certain conditions then we can send an email notification. It's worth saying that modern data warehouse solutions allow SQL scripts to do such checks but it doesn't have to be limited to SQL. Any data check script can be run on data in the data lake or anywhere else. It just depends on the type of our data platform. Good coding skills are a must in this case so we would want to demonstrate that we know how to create a simple patrol application that can scan our data depending on where it is located physically.

The SQL-based answer is also good but it would be more suitable for the Data Developer role as SQL is often considered the main data querying dialect in analytics. Consider this example below. It will use SQL with row conditions to check if there are any records with NULL payment_date. It will also check for duplicates.

with checks as (
    select
      count( transaction_id )                                                           as t_cnt
    , count(distinct transaction_id)                                                    as t_cntd
    , count(distinct (case when payment_date is null then transaction_id end))          as pmnt_date_null
    from
        production.user_transaction
)
, row_conditions as (

    select if(t_cnt = 0,'Data for yesterday missing; ', NULL)  as alert from checks
union all
    select if(t_cnt != t_cntd,'Duplicate transactions found; ', NULL) from checks
union all
    select if(pmnt_date_null != 0, cast(pmnt_date_null as string )||' NULL payment_date found', NULL) from checks
)

, alerts as (
select
        array_to_string(
            array_agg(alert IGNORE NULLS) 
        ,'.; ')                                         as stringify_alert_list

    ,   array_length(array_agg(alert IGNORE NULLS))     as issues_found
from
    row_conditions
)

select
    alerts.issues_found,
    if(alerts.issues_found is null, 'all good'
        , ERROR(FORMAT('ATTENTION: production.user_transaction has potential data quality issues for yesterday: %t. Check dataChecks.check_user_transaction_failed_v for more info.'
        , stringify_alert_list)))
from
    alerts
;

As a result BigQuery will send an automated email containing the alert:

What algorithm would you use to extract or process a very large dataset?

This question might be a trap if you had previous questions about data transformation with Python. If you like Python then you are probably a big fan of the Pandas library and you probably already mentioned this during the interview. Well, this is the kind of question where you wouldn't want to use Pandas. The thing is that Pandas doesn't work with big datasets very well, especially with data transformation. You will always be limited to your machine's memory while running data transformations in the Pandas data frame.

The right answer would be to mention that if memory is limited then you would find a scalable solution for this task. This can be a simple Python generator and, yes, it can take a lot of time but at least it won't fail.

# Create a file first: ./very_big_file.csv as:
# transaction_id,user_id,total_cost,dt
# 1,John,10.99,2023-04-15
# 2,Mary, 4.99,2023-04-12

# Example.py
def etl(item):
    # Do some etl here
    return item.replace("John", '****') 

# Create a generator 
def batch_read_file(file_object, batch_size=19):
    """Lazy function (generator) can read a file in chunks.
    Default chunk: 1024 bytes."""
    while True:
        data = file_object.read(batch_size)
        if not data:
            break
        yield data
# and read in chunks
with open('very_big_file.csv') as f:
    for batch in batch_read_file(f):
        print(etl(batch))

# In command line run
# Python example.py

The optimal answer should include transforming the data using distributed computing and ideally some tool that is fast for this purpose and scales well. Spark or HIVE-based tools might be a good choice.

When would you use Hadoop in your pipelines?

You would want to mention that Hadoop is an open-source Big Data processing framework developed by Apache Foundation and it brings all the benefits of distributed data processing. That's why it became so popular in data pipelines processing large volumes of data. It has its own intrinsic components that aim to ensure data quality (HDFS – Hadoop Distributed Data System) and scalability (MapReduce). Even if you don't have experience with Hadoop it should be enough just to mention these things as there are a lot of tools built on top of Apache Hadoop, i.e. Apache Pig (a programming platform that executes Hadoop jobs in MapReduce) or Apache Hive – a data warehouse project where we can use standard SQL dialect to process data stored in databases and file systems that integrate with Hadoop.

How would you approach a big data migration project?

During the job interview, you might be asked this question as interviewers would want to understand your experience regarding data migration and approach to data validation when it is complete. Here I would recommend starting with business requirements. It might be cost-effectiveness, data governance or overall database performance. Depending on these requirements we can select the optimal solution as a destination point for our migration project. For example, if your current data platform is built on a data lake and there are a lot of business stakeholders who want to access the data then your choice should be between ANSI-SQL data warehouse solutions where we can offer better data governance and granular access controls. On the opposite, if our data warehouse solution has cost-effectiveness issues related to data storage then migrating or archiving to datalake might be a good option. I previously wrote about it here:

Data Warehouse DBA Tasks I Do Daily

Once the migration is complete we would want to validate the data. Data consistency is the top priority for data engineers and you would want to demonstrate that you know how to validate that no data is lost when the migration is complete. For instance, we could calculate the total number of records per partition in the data warehouse and then compare it against the number of records in data lake partitions. count(*) is the least expensive operation but it is very effective for data validation and can be run fast. In fact in many DWH solutions count(*) is free.

Which ETL tools do you know and how is it different from ELT?

Answering this question we would want to demonstrate that we know how to extract, transform and load the data not only with third-party tools but also by writing our own bespoke data connectors and loaders. You can start with a quick note that there are managed solutions like Fivetran, Stitch, etc. that help with ETL. Don't forget to mention their pricing models that often are based on the number of records processed.

You don't need third-party ETL tools when you know how to code.

Don't be shy about saying this phrase. It is fairly easy to create your own ETL tool and then load the data into the DWH solution of your choice. Consider one of my previous articles where I extract millions of rows of data from MySQL or Postgres databases as an example. It explains how to create a robust data connector and extract data in chunks in a memory-efficient manner [12]. Things like this were designed to be serverless and can be easily deployed and scheduled in the cloud.

Building a Batch Data Pipeline with Athena and MySQL

We can even create our own bespoke data loading manager if we need to prepare and transform data before loading it into the DWH destination using cloud SDKs. It's a fairly complex application but it's worth learning it. Here is the tutorial.

Conclusion

"What is your approach to …?" type of question is very common in Data Engineering interviews. Be ready to answer these scenario questions. During the interview, you can be asked to design a pipeline or a data platform. That's all it is if you take a look at the wider picture. Every data platform has its own business and functional requirements – it's always good to put this phrase in and then mention that you would select data tools based on these requirements. Decorate your answers with data pipeline examples and you will most definitely pass.