Data processing pipelines: a Swiss Army knife for data engineering
In the AI era everything is data and every kind of data can be processed and analyzed to produce a generalized model of its inner relationships. It really doesn’t matter if it’s text, picture, sound, sensor readings, video… there are machine learning and deep learning models able to handle the vast majority of data types.
Machine Learning/Python developer
Nowadays, with all the data coming from different sources combined with the ongoing commoditization of ML and AI, attention gets shifted from data science to data engineering, which struggles to gather, prepare and integrate data from multiple - sometimes very different - sources and tries hard to get all the data ready for analysts and data scientists to work and act on it.
One of the most valuable tools in the data engineer’s portfolio - kind of a data engineering Swiss Army knife - is the data processing pipeline.
You may like also:
What is a data processing pipeline anyway?
A data processing pipeline is a set of instructions (usually algorithms of some sort) that tell the system how to handle data. It is a kind of roadmap - usually created by a data engineer - showing discrete steps from the initial state to the final state, each step being another recipe.
I know this sounds cool, but it doesn’t really help on how to implement one, right?
This is where the real beauty of data processing pipelines begins: I’ve worked with data pipelines for some time now, I’ve implemented dozens of them, and so far I have learnt that there is no single solution to suit (or rule) them all. It is all the matter of data type, business context, amount of data and many other variables.
What’s ETL and how does it differ from a data processing pipeline?
You have probably heard about the Extract Transform Load process (ETL) and now you may be wondering how it relates to data processing pipelines. Long story short: ETL is a type of data processing pipeline with three phases.
In the ETL process, data is at first extracted from its origin: it may be a website, relational or non-relational database, set of log files, but in broader terms, also a data lake.
Then there comes a transformation. In this broad step, data is merged, transformed, text may be lemmatized or labeled with topics, images may be resized or desaturated. In this step you may also use machine learning solutions to transform the raw data into a more processed, more specific form. It’s kind of funny that we are using machine learning, deep learning and AI to prepare data for other - usually more complex - models, but it happens and that’s cool!
The last step is loading, which usually means loading the transformed data into a database in a form that has way more structure than the initial data.
ETL may be an early part of a machine learning pipeline or it may be used to prepare data for deep learning applications. It is also a process commonly used to move meaningful data from data lake to data warehouse.
Making pipelines sounds good, but how?
There are two main topics to cover when talking about pipelines themselves: the technology used to create a single pipeline, and processing in the cloud. These two aspects are tied together because some technologies force us to use the cloud (e.g. most of the proprietary solutions for data transformation are cloud-native).
There is also a third point: managing your pipelines in an effective way. This is the crucial part if you wish to create a data-driven service or provide data for analytics inside your company, but for some reason the topic is mostly overlooked at the early stages, when the cost of applying monitoring standards is the lowest.
I would say, “use Python”, but I have already stated earlier: there is no one go-to solution, so I think I need to introduce a bit of pluralism here. Let’s start with Python anyway, because it has some real benefits to offer.
- Custom pipelines with Python
It’s FOSS, it’s perfect for dealing with data, it can be made into a full machine learning pipeline without any unnecessary additions and external connectors. It can handle almost any type of input and output (even if it can’t by default, you can fix it with your engineers). It is also testable (which is always good!) it can run on wide range of devices and clouds and it already has few good libraries designed for data pipelines.
Building pipelines with Python has so many advantages I could write a separate article about it, but it also has one big disadvantage: creating pipelines with Python is not always intuitive, because it lacks a visual representation of the pipeline, therefore you need data engineers comfortable with Python.
To be fair, I need to state that Python is not the only language capable of implementing data pipelines. Scala is also a nice choice (especially if your pipeline includes the Hadoop ecosystem). Java works well too! And whenever you need a highly scalable solution, you should consider using Apache Kafka, which is data processing pipelines orchestration tool build with scalability in mind.
- Visual tools, e.g. Pentaho
Pentaho is actually a business intelligence platform, but it has an integrated ETL tool (it’s called Pentaho Data Integration) therefore it’s capable of creating a functional data processing pipeline.
What makes it cool? Pentaho is FOSS. And it has a proper GUI, meaning you can prepare your pipeline connecting icons together; no code writing needed. It’s way more intuitive from a coding perspective and the visual representation of the pipeline makes it powerful.
\ On the downside for visual data processing tools, I must note that these are less flexible than other solutions. Also the debugging capabilities and testability is at a lower level here.
Pentaho is not the only visual tool capable of creating data processing pipelines, but it’s the most notable in the FOSS family. Outside of this group, there are also some notable representatives, for example Alteryx and Informatica.
- Pipeline in database
Third of the three major ways to implement a data processing pipeline would be directly in the database, most probably using SQL for relational databases. This method has its obvious restrictions, though: it can take data only in the form of a relational database (forget about multiple types and sources of data collection). At the same time, correctly implemented in-database calculations prove to be very fast and effective whenever applied to whole tables.
This kind of pipeline cannot be applied to every set of transformation and aggregation processes, but is worth considering in scenarios with big collections of structured relational data.
To cloud or not to cloud?
Once you settle on technology, you may start wondering how to execute the data processing or machine learning pipeline. It may seem a foolish question, but I assure you it’s worth asking: should I put my pipeline in the cloud?
The benefits are obvious: with the cloud we are getting rid of all the infrastructure issues (or more precisely those issues are delegated to a third party) and we don’t really have to design for the database capacity and overall server performance, which we have to take into account for machine learning and deep learning pipelines. In the case of database capacity issues or underpowered instances, we can always pay more for more space or for a better machine.
But does the cloud present us only with benefits? My working experience has taught me there are situations, institutions and kinds of data where the cloud is not an option, or at least it’s not the preferred one. If you’re analyzing highly sensitive data or you are working for an reputation-sensitive institution (like an investment bank for example) dedicated, on-site database and a calculation server are options worth considering.
In any other situation, cloud should be your go-to solution for wrangling, processing and analysing big amounts of data for machine learning or deep learning pipeline.
Creating the pipeline is not the hardest part; managing multiple pipelines is
With the pipeline created and in place you may start to see additional problems: how to run the process in an automated manner? Is my sequential data processed in the right order? Is it in a good shape? At what stage of processing is my data right now? Can the process fail, and what happens if it does?
There are tools on the market able to help with those questions, but it’s important that you answer some of them before you even start to implement a specific pipeline.
- Do I need to quality check the product of the data processing pipeline?
- Do I need to monitor the state of processing?
- What should happen if the process fails? How would I know?
- Should the process be triggered automatically or manually?
These questions may help you understand your needs better in order to assess if you even need a pipeline monitoring system; e.g. when you start small, with a single simple pipeline, a monitoring system may be overkill, stopping you from getting analysis insights or from further implementation of machine learning and deep learning pipelines.
If you decide to give a monitoring system a try you may - again - consider going for FOSS and try a production-ready solution like Apache Airflow. It solves most of the issues with pipelines, providing a scheduler. And it has the ability to monitor stages of multiple pipelines; it even has a Gantt chart, showing scheduled tasks with their durations.
When you start from scratch, it may be tempting to build a monitoring system by yourself. This is a very beneficial exercise which I have taken with one of my clients in the past. Long story short: in a few months we built an in-database monster with tons of relationships, flags, statuses, triggers and cursor-based queries. In the end we had to admit our defeat, replacing our monster with some proprietary software. Anyway, if you have the time and money to learn from your own mistakes, I highly recommend going through this process. Otherwise, it is better and faster to look for a helping hand from someone who has already traveled this path.
Data processing pipelines - a worthwhile effort
Building data processing pipelines for machine learning or deep learning projects is a complex and challenging activity, but it’s also very beneficial. Well-designed, scalable and testable pipelines with a proper monitoring system give you a stable environment for your data science, data analysis and business intelligence projects. It provides a stable data flow from the extraction phase, through all the merging, melting, wrangling and other transformation activities, to loading the data into the destination database.
Such a large amount of processes may look overwhelming, but at the end of the day - it’s definitely worth a try!