Data Engineering vs Software Engineering
In my mind Data Engineering and Software Engineering are similar roles but with some key differences. It’s taken me a while to articulate these differences so I am taking this time to get them written down here, and hopefully it will help clear things up for you as well.
Data Engineering is a relatively new field, hence it is difficult to define exactly what it is and who does it. So instead I believe it is more helpful to compare examples of projects in each field:
Software Engineering:
Build real-time event processing systems
Web Server
Front-End User Interface
Aircraft Control System
Data Engineering:
Move data from one place to another
Extract data from a 3rd-Party system
Transform and normalize raw datasets
Collect & maintain data for analysis
To start off, let’s look at the things they have in common. Both fields write code to manipulate and store data. A software engineer might take input data from users, network requests or hardware sensors, but both types of projects ingest and interpret data to fulfill some sort of functionality.
One of the key differences though is how these disciplines orient their data. Software engineers do operations on a row by row basis, add another row to this table, update these rows over here, remove this row over here. This is because their application’s state is built up over time as real-time events come in, transforming and building the state of the program.
If problems arise, just “turn it off and turn it back on” meaning if the program’s state has gone bad, clear the state so that the program can rebuild a good state.
Data Engineers on the other hand look at data on a column by column basis. Create a new column over here, drop the columns over there, join these tables using this column. This is because they are trying to reshape and organize some existing state.
If problems arise, hopefully you saved a backup or are able pull again from an upstream source, because you can’t rebuild a good state on your own.
Now of course this isn’t a cut and dry distinction, you can easily find examples of Data Engineers doing row-by-row operations and Software Engineers doing column operations, but in my mind this distinction helps explains a lot of the tooling discrepancies we see between the two fields.
SQL is a Data Engineer’s tool of choice because it views and transforms data in a columnar fashion. It is common to see data engineers rebuilding an entire table during every refresh, rather than using incremental updates to only the update the changed records. Because it makes it easier to ensure consistency across an entire column, only resorting to incremental updates when the dataset gets too big to process all at once.
A software engineer would never be okay with rebuilding an entire table every time a value changes or waiting a couple hours for the next scheduled refresh to run. Because software engineers needs their programs to run quickly, and have the state of the program always up to date. Which is why we only see batch processing with data engineering, where it’s usually okay if the results are delayed a couple hours or days.
This brings me to the last difference I would like to point out, the people these projects are built for. Software engineering projects are usually customer facing, meaning that they are on the critical front-line with a lot of pressure to ensure that their program works as expected, because any mess-up will directly affect all customers and might hurt the companies reputation and profits especially at a SaaS company.
Now it’s more rare to see a company built around a data product rather than a software product so data engineering projects are usually internal facing. Meaning that any mess-up won’t directly affect the companies reputation or profits. Which means that the job can be less stressful, but also that you might not get paid as much as your software engineering counterpart.