What You Need To Know As A Data Engineer in 2025

Data Engineering Essentials in 2025

Apr 30, 2025

As I happened to be on a hiring spree for a data engineer in 2025, I wanted to share my thoughts on what candidates need to know for an entry or mid-level role.

How should you be using this outline?

I have met many engineers and students who think that AWS or any platform-specific certification will help them secure jobs.

However, I have never hired an engineer based on AWS certification, though my team uses the AWS tech stack daily.

AWS certification is excellent if you're unfamiliar with their tools and services. However, it is not a testament to a data engineer's skill.

An experienced data engineer knows the best practices of data engineering and how to be adaptable to different business requirements. It is platform agnostic regardless of the cloud services or tools you use.

Certification in services and tools is a bonus and never a core requirement.

If you want to set a path to upskill, then the information below is for you. It is inexhaustive and may differ for different companies, but it will cover most job requirements in the industry.

Contents Overview

A programming language that is commonly used for manipulating data.
Batch-based data pipelines
Streaming data pipelines (Can be optional)
Big data tools and concepts (Can be optional)
Data modeling
Managing relational and non-relational databases
Continuous Integration / Continuous Deployment (CI/CD) frameworks
How do we ensure data quality and integrity from source to destination?

Programming language for manipulating data

Any language can be used to manipulate data, but Python is the current industry standard.

Other noteworthy languages include R programming. However, Python takes the cake for its maturity, community support, versatility with Jupyter notebooks, and reputation for production services.

R is excellent for quick prototypes and statistical analysis, but I have not seen any data or machine learning (ML) engineering models deployed in production as a service in R.

Batch-Based Pipelines Overview

A data pipe is a series of processes and logic where data is transferred from source to destination.

Batch-based pipelines are all about transferring data in chunks on a set schedule, and it is one of the easiest to learn because:

It is usually for business objectives that are not time-critical.
You control every aspect of the pipe where you can manage load and time; thus, given your resources, you can always experiment optimally.
Batch-based pipelines are prevalent in every company.

Batch-based pipelines are foundational knowledge for any aspiring data engineer. At the very least, you need to know what is:

Extract-Transform-Load (ETL) design
Extract-Load-Transformation (ELT) design

There are other types of pipelines, such as machine learning (ML) pipelines, but if you're aspiring to be a quintessential data engineer, the above is good enough.

Orchestration Tools

Many companies use orchestration tools such as Apache Airflow or Dagster as their default platform.

Orchestration tools help a data engineer ensure that every part of the process is well-defined and decoupled as much as possible.

Pro Tip

You don't have to master every orchestration tool to be good at batch-based processes. Many concepts, such as writing Directed Acyclic Graphs (DAG), are transferrable.
Avoid building a deep tree with DAG. The deeper the DAG, the more time it takes for the service to recover if there is an issue whenever you are doing an end-to-end validation.
- It can be very frustrating to deal with a business-critical issue and wait a long time to validate because the DAG is deep.
Ensure that every step that you create within a DAG is testable on its own.
- When a DAG step cannot be isolated for testing, it becomes difficult to troubleshoot when a bug arises.
Always create repeatable patterns with DAGs. It saves you plenty of time when it comes to new pipes.

Streaming Data Pipelines Overview

When data freshness is critical to the business outcome, you'll need near real-time data capabilities to respond effectively.

For example, you'll need real-time streaming for fraudulent credit card charges because if bad actors are given time, they will cause more financial damage.

Streaming data pipelines usually come in 2 transmission forms:

Synchronous
Asynchronous

When I worked at CBS Sports for NFL real-time prediction, the plays we ingested had to reflect the actual live game activity. Thus, the transmission must be synchronous.

Generally, asynchronous transmission is preferred as it is a non-blocking operation. Since there is no need to capture the data in order, you can scale the pipes easily.

Data Streaming Tools

Apache Kafka, AWS Kinesis, and RabbitMQ are some of the tools companies commonly use.

It may be easy to set up a demo to run a data streaming pipe, but recovering a broken service can be very difficult when the business requires high-volume and high-throughput operations. It takes experience and the proper business environment to manage different volume and throughput levels well.

Pro Tip

Don't let your lack of streaming experience prevent you from applying for jobs. Many companies do not have a good use case for real-time analytics and may not be hiring for this skill.
If you want to start with data streaming, stream web server logs. It is one of the easiest places to start because it is easy to set up a website with a server and stream these logs into a centralized place.

Big Data Tools & Concepts

Suppose a company or business operates on a very high volume of data daily, in terabytes and petabytes. In that case, it is essential to know some of the big data tools and concepts to be relevant.

Tools such as (inexhaustive):

Apache Hadoop and MapReduce
Apache Spark
Apache Flink

Cloud services provide these tools on-demand, such as (inexhaustive):

AWS Elastic Map Reduce (EMR)
Google Cloud Dataproc

Large companies usually have the resources to house and process high volumes of data in the regions on petabytes. At the same time, small and medium businesses typically wouldn't go beyond terabytes of data because it is expensive.

Thus, big data tools are not essential if the company you're working for or applying for does not have a business case for it.

Data Modeling

Data modeling is the process of organizing and representing data in a way that helps to answer business needs and provide actionable insights.

Data stored but unused is an expense. After data collection, data must be preserved and represented in such a way that:

It reflects business and operational processes accurately on the ground.
The data is trustworthy, and there are no integrity issues and errors that could impair our interpretation of the data.
We can scale and manage the data schemas as we collect more data for the betterment of the business.

Data modeling is both an art and a science. There is no absolute way to do things, but there are best practices that can help us optimize our data schemas and structures, which are contextualized to the business.

Pro Tip

Data modeling has existed for years, and established methodologies exist, such as Kimball data modeling and the Inmon method. Leveraging these resources as your baseline will help you improve quickly at data modeling.
It takes practice to be good at data modeling, which is often required in tech interviews. To practice, you must familiarize yourself with different business contexts and domains and shape a warehouse or database based on these business contexts.
No perfect answer exists, so you mustn't struggle with this topic.
- If you're data modeling at work, the goal is to iterate your experiments and try different modeling methods until the model is optimized for your business use case.
- If you're trying to ace an interview, the interviewers typically want to know if you have sound data modeling principles. It is not about whether you can give a perfect answer.

Managing relational and non-relational databases

As data is stored within databases, it is imperative to know:

How to optimize and manage the load on databases, especially if your business case involves high volumes of transactions.
How to ensure periodic backups so that if your database fails to run and has to be removed, you have a backup to ensure there is little to no data loss.
The necessary steps needed to recover a failed database service.

While you may not be in a position to create new database clusters because you have a DevOps person to work within your company, it is essential to know what it takes to recover the service so that you're able to facilitate the recovery process.

DevOps can recover a service from failure but may not have expertise in managing your data. Working with your DevOps person is essential to recover the data and the service.
Not all DevOps are well-versed in managing database clusters, and as a data engineer, you'll need to provide advice and expertise.

There are two basic types of databases:

Relational databases
Non-relational databases

Other storage options or types can exist, but tech interviews typically focus on what is commonly available in the industry. Unless you're interviewing for a role that uses a certain type of database, such as graph databases, there's no need to learn and memorize the entire catalog.

I'll not cover the details of the different types of databases because there is a lot to digest, but to ace your interviews, you'll need to know:

What are the best use cases of relational databases vs non-relational databases?
The strengths and weaknesses of each type.

Tech interviews typically won't ask you questions that you can answer by memory through searching online. They will give you a business context and ask you to design a solution to store the data based on the business context.

Your approach and attention to detail reveal everything we need to know to understand your maturity level with these designs.

Pro Tip

Learning how to set up and manage databases isn't tricky. You can always do it locally on your machine to try things out.
- However, if you're doing it on a local machine, you won't be able to account for scale and volume in a professional setting.
You don't need to memorize every use case regarding relational and non-relational databases. The point is to learn a few typical use cases and understand why they are used in their context. From these examples, you can think on your feet when you're asked to design a solution.
- Suppose you know a few key concepts related to relational databases versus non-relational databases. Applying the most optimal design to a problem based on its principles will not be difficult.
Many other types of databases are derived from relational and non-relational databases.
- Once you know the basics, the knowledge and skills transfer to other kinds of databases.

CI/CD Frameworks

When working in a company, you must deploy changes and new features or services through a CI/CD process.

We must version-control these changes to mitigate problems from bad actors or resolve conflicts when too many engineers work on the same code repository.

The goal isn't for you to be able to be a subject-matter expert in CI/CD frameworks. Rather, the goal is to inspire confidence that your deployments to production are working as intended through testing and version control as a team.

For tech interviews, you won't be asked in detail about CI/CD because CI/CD processes are tailored to the company that you work with.

However, you may be asked to describe the high-level processes during an interview within a CI/CD deployment.

Pro Tip

Companies can use different tools to manage deployments, such as CircleCI or Jenkins. The goal isn't to know every single tool out there. Rather, understand the baseline concepts so you can apply these principles in any company you work in.
It is possible to simulate CI/CD processes on your local machine or in a sandbox environment in the cloud. If you're new to the topic, this will help you better understand the concepts.

How can data integrity, quality, and timeliness be ensured from source to destination?

Data integrity, quality, and timeliness are vital to any business as they help the company make informed decisions about its value and context. Thus, a testing framework is required to ensure these benchmarks are met.

These testing frameworks can be:

Built in-house by a company
Tools in the market, such as DBT

You need to know:

How can the principles of data integrity, quality, and timeliness be applied to different architectural designs and business use cases?
What can you do as a data engineer to uphold these principles from a day-to-day perspective?

In a tech interview, questions are uncommon to focus primarily on testing. However, it is relatively common for interviews to ask how you would ensure best engineering practices for data integrity, quality, and timeliness.

Pro Tip

Anyone can build a testing framework from scratch if they know the tenets of data integrity, quality, and timeliness and have sufficient coding knowledge.
Never over-test your work where you cannot apply relevant actions to address issues. It is a waste of time because testing does not inherently provide value to a business. Instead, it reduces the risk of your work from failure.

What separates someone great from the good?

These are some of the tell-tale signs that we know if you are excellent at your work:

The amount of data you manage daily.
- The larger the data, the more resources you need to process it from source to destination. You'll need to plan around scalability and maintainability to accommodate more overheads, which speaks of experience and skill.
Able to optimize the architectural design and data processing steps according to the business context and efficiency.
- Both scissors and chainsaws can cut. However, you wouldn't typically use a chainsaw to cut paper or wood. Experienced engineers know how to design data pipelines according to their context.
You know the strengths and limitations of each design and framework.

Summary

Data engineering is an ever-evolving subject because more tools and concepts are constantly being developed for new use cases.

However, the fundamentals rarely change, and innovations are often built on top of established practices.

I didn't cover many other topics, such as database change management systems and system health monitoring. However, you'll learn these along the way, and you'll unlikely come across them in an interview.

This article highlights what I have learned in my career, which has lasted over 14 years. Feel free to provide feedback or comment if I missed anything.

Level Up With Data

Discussion about this post