The process of Extract, Transform, and Load (ETL) is the process of taking data from one or more sources, and sending it to a different destination. During the process the data will be converted to fit the new destination locations.
While ETL doesn’t get the attention of many other algorithms, it is a very important process that many businesses need to be able to perform as data comes in from many sources and has to be processed. This can be a scheduled process with data from outside sources on a daily basis or even more often, or a one time data conversion process when an organization moves to a different system.
The data extraction process may pull data from a single source, or multiple sources. The sources may be in the same format, or they may even be in different formats. When the data formats of different sources are heterogeneous, you must find ways to link the data together.
However the data often doesn’t come in a format that can be used natively, therefore it must be transformed. The data type and order often needs to be changed. This can be as simple as rearranging data, to combining or splitting data, or even converting a data format. Often data comes in via a string, or text data, that needs to be converted.
Since data might come from different sources, from different vendors. The data might come from a SQL database, JSON, XML, CSV, or other format. Your application will need to know how to load and process the appropriate files. Luckily there are tools to help you with the loading and processing.
Unfortunately, most ETL processes use a brute force methodology so traditional schools tend to ignore the process. However, in the transform step – there are often some unique challenges, that while simple for a single instance of data, makes you have to wonder when the data is run against tens or even hundreds of thousands of rows of data – let alone larger data sets. So learning how to optimize the process becomes very important.
The other key element in converting data, is knowing that your data is not always going to be correct. Whether data is missing, in an incorrect format, etc, there will always be a need for strong data error checking, and a system that finds a way to work around those hazards.
However, the downside of this is that because you are often working to get data into a system, each application is going to require a unique set of processing.
Once the data is converted you will need to load it into your system, or systems as data may need to be split into different sections to load the data into the systems.
Extract, Transform, and Load was originally found on Access 2 Learn