Data Engineering Best Practices
The Data Engineer role is responsible for the three D's of the development life cycle. Design, Develop, and Deploy.
The core task of the Data Engineer is to build batch and/or real-time data pipelines that deliver quality, accurate,
and reliable data to the end user (application, report, human).
Here are a few skills and best practices that a Data Engineer should seek to know and master.
Non-functional skills
- Understanding the right tool for the job and using it
- Don't use a schema-on-read service to provide data to a REST API. You must use a relational or NoSQL data
store, memory cache, or other kv stores. Any service that offers an index that can return data in milliseconds.
- Further down is a chart mapping use cases to services.
- Recognizing patterns, establishing a design, and implementing code to match the pattern
- There are only a few times that something new needs to be developed. Find and use established working
patterns.
- When presented with a use case, follow this pattern
- Is this process currently being used anywhere else?
- If so, review the solution, talk to the developer, and find out if the solution is working well. As the
development team, they would change anything if they needed to start over.
- If not, why? Is there a better way to accomplish this work that I need to know?
- Get clear acceptance criteria from the end user
- A task that says "review the software documentation" as a task without any acceptance requirements is not the
right way to do it.
- Perhaps the task should read, "Create a document explaining the purpose and usage of the software."
Acceptance criteria are: "A document."
- When estimating the time to complete a task, do not just think about how long it takes to "code
it."
- You must create test data, test cases, documentation, PR requests, and discussions.
- And perhaps the task is not "done" until the PR is approved, which means time needs to be added to answer
questions of PR approvers and make corrections.
- Pull requests are different from the time to ask design questions
- If the work has made it to the PR stage, a new ticket must be created to redesign the widget.
- If there are design questions in the PR review, you need to get to the bottom, or it will sink your project
with endless rework.
- A PR should keep many files or many lines of code the same. PR should be concise. Nobody wants or has time to
review a PR that changes half the code base.
- Local testing is a must. If you join a project that does not have a method to test the widget locally, make that
your first assigned set local testing environment. It will save you and others time in the long run.
- Don't be afraid to admit that you don't know the answer to a question.
- Knowing all the answers to all technical questions in the known universe is impossible. Feel free to say, "I
don't know, but I'll find out' or "Hey, I don't know what I'm doing yet; please help me."
- If honesty is looked down on by your company or project team, then it's time to dust off the resume.
- Continuously update status and comments on the issue/task
- a professional data engineer will have a schedule to keep task status updated on all tasks along with
comments. A Jira ticket that has been in progress for 2 weeks without notes is something other than something a
data engineer professional does.
- If you host a meeting, always send meeting notes afterward.
- It is professional and provides a knowledge base to refer back to.
- It eliminates doubt or disputes about critical decisions
Functional Skills
- Have a coding standard and follow it. If it does not exist, then open a Jira to create it.
- Programs are collections of functions and a driver. It is straightforward.
- Code should not exist in the main body of the program
- If you are using functions, then writing test cases should be easy. It must be refactored if you find it
challenging to test a function. Difficulty writing test cases indicates that your function needs to be
simplified or does too many things.
- Functions must accept input and produce output.
- The input should come from the function arguments, not mysterious places like side effects from previous
calls.
- Don't call other systems from functions. You need to find out how many times your function will be contacted.
If you are making a connection or executing a slow or long process, you must push it up.
- Your entire program should not be in a function. It takes work to test an Uber function. Break it down into
small functions that do one thing.
- It needs to be split if you have more than a few lines in a function.
- Data analytics
- get stats on the data to help you understand the datasets.
- Ask what the min and max sensor readings should be.
- how many rows and columns
- Get a list and count of the domain of values (DOV) for text fields.
- DOV Is the unique list of values in a field.
- If you get a DOV of a state name column, you would expect to see 50 entries, one for each state. If you
see 86 entries, then you have a data quality problem.
- It is also helpful to get a count of the DOV data. If your customer base is the entire US, then you
would expect a reasonable distribution between values based on state population. If the state of CT has
10x the count of CA, then it should be investigated.
- What are all the distinct values in the merchant's name?
- Get the max and average of numeric fields.
- Min and max of an age column should be 1 to 120 max. If you see other values, then check the data
quality
- Make notes of all your findings and talk to data owners.
- JOINS
- anytime data is entered, you need to analyze the join columns
- get DOV for all joined columns and compare
- Always perform a full outer join first, letting you see the matched and unmatched values. You might spot a
join failing due to a space or misspelling.
- You might find that you are joining null values on each side, which could have been eliminated with a 'not
null' clause in the query.
- Inspect the full outer join results and get a count of matches on the left, correct and unmatched.
- Refrain from making assumptions about the data. Talk to the data owner or expert to verify assumptions. You
might omit critical data from the final results if you disregard unmatched and put in a left join.
- Output Tuning
- The output (objects/files) being written to the object store or filesystem must align with the storage
service's recommendations.
- When writing to AWS's S3 storage system, using the parquet file format with snappy compression and a 1GB
object size is recommended.
- The best way to do this is to write the dataset to the object store without any modifications. Then, inspect
the object sizes and adjust using repartitioning or max output size.
- Partitioning
- Partitioning is essential for query performance from an end-user and application perspective.
- End-user partitions are usually different then partitions used by ETL or processes.
- A standard ETL partition is by data arrival date, such as "asofdate=20230102". End-user queries usually are
not concerned with an arrival date of data and will typically use a transaction date
"transaction_date=20230102".
- Multi-level partitions are often used when data is too large to be contained in a single-level partition. In
such cases, the data might be split by asofdate, source system, transaction date, and transaction state.