Dear friends!
Data transformation is a crucial part of any data pipeline. For data to have the biggest impact it not only needs to be accurate, but it needs to be consumable. James Kaleo Kistner and I will provide a good foundation on the concept, modern tools, and process surrounding data transformation with the goal of increasing the accessibility of your (big) data and the consistency of your data strategy!
What is Data Transformation?
Data transformation is the process of changing data from one format to another. While this can happen at many places in a data pipeline, the two most common methods are ETL and ELT. The different parts of ETL/ELT are the extraction, loading and transformation: The only difference between these methods is the order in which you perform the last two steps. ETL and ELT are processes used to get data from one, usually raw, location into another destination where the data is more refined after transformation. With a basic understanding of what data transformation is, lets walk through the four main steps to transform data.
#1 Sourcing the Data: Start out by identifying all the data that needs to be transformed. This data could be coming from many different places: websites, software, and databases, to name a few. You can use a data profiling tool, such as Talend Open Studio and Apache Griffin, to complete this.
#2 Mapping the Data: Data mapping is the creative process of figuring out how to connect data from two different places by turning data from one format into another. Data will come in all different sorts of formats (JSON, text, boolean, integers) and need to be parsed appropriately to fit the target format. Start off by figuring out what format your data comes in, and what format you need it in. Then map the origin fields to the destination fields, and identify the code you will need to change the data type.
#3 Coding the Transformations: Now that the data has been mapped, it is time to figure out what transformations can be used to achieve the desired output. You will need to write up a code that can do the transformation for you, but you don’t need to do it from scratch! You can use tools to help you, such as dbt or Apache Airflow. While this is the key part in creating the outcome you desire, it is also the point where the most mistakes can occur. To ensure you don’t accidentally corrupt your data, you can use a sample dataset and run the transformation on it. Once you have made all necessary changes, you’re ready to go!
#4 Validating the Data: Far too often people overlook this last step because it isn’t very glamorous, but it is imperative to the process. Double-check your work! Run validation tests to make sure everything worked as expected. Some of these tests might include checking for uniqueness, checking columns for expected values, and looking at counts and aggregated values. Cross-validate data with as many places as possible to ensure the quality and integrity of the final output. Data transformation can be a long process, and even if your data was error-free at the beginning, you may have introduced inconsistencies at several stages. Checking your work now can save you a lot of pain down the road!
Transforming data might not be as flashy as building machine learning models, or creating beautiful dashboards, but there is a lot of fulfillment in getting the most out of the data. It is an essential step in every other analytical task, so knowing how to manipulate data is necessary in every data-driven field.