A year ago, I started to work as a data engineer when my team was tasked to automate some data sets and reports for the businesses/sites our team was supporting. The challenge came as these sites didn’t have any data pipelines setup other than simple event capture. What started as a three months project has got me hooked on data and analytics. And in this post, I summarize my experiences and key learnings of the past year.
What is Data Engineering?
Data engineering is a subset/specialization inside software engineering where engineers build systems that move data from point A to point B and transform it to be easily consumed by the end-users.
What I had to do?
For context, my company owns and operates over 40 brands and businesses. Each team works independently, but we share a central tech and data team with some core components and technologies that serve as a platform for business and allow us to switch teams faster.
Learn the Stack
Our data stack is based on Spark/Scala, Airflow, and Redshift. For this, I took a few Udemy courses to understand the basics of each of these components. Completing end-to-end projects to get reps in really helped wrap my head around this. I also had a lot of references and support from other teams in this process.
Setup the Sack
In my case, we were forming a new data team for a set of businesses managed from our office. The first step was to spin up the data infrastructure and connect to the central data infrastructure where the data lake and Databricks were located. For this step, I had to polish my AWS skills and learn Terraform as it is the way we deploy our infra. Cloud Guru and Udemy were also great resources in this step but nothing like doing the actual thing.
Learn – Try, Rinse, and Repeat – until we were ready
The first few months of our team mainly were learning by trying and error and having a great support system across the company. Also, we kept a direct line of communication with the business to ensure that we were on the right path to what they needed from us.
What I would do different?
My experience as a data engineer has been at an enterprise-level company where we have a lot of processes and guidelines pre-set in place. But If we’re asked to define a new data stack, I would probably go with a fully managed modern data stack to reduce the operational overhead of setting up and maintaining data infrastructure. Like with everything in engineering, it depends on the actual business needs and current data stack if a migration is required. Still, data tools have advanced a great deal in recent years, and many of the previous engineering challenges have been commoditized.