Spark: A Data Engineer’s Best Friend
Data engineering, as a separate category of expertise in the world of data science, did not occur in a vacuum. The role of the data engineer originated and evolved as the number of data sources and data products ballooned over the years. Therefore, the role and function of a data engineer is closely associated with a variety of different data-processing platforms such as Apache Hadoop, Apache Spark, and a huge number of specialized tools. In this article, you will find out why Spark should be considered the data engineer’s best friend.
- 1 Spark is the ultimate toolkit
- 2 What makes Spark unique
- 3 Designed for the data engineer
- 3.1 About Don Wake
- 3.1.1 Don has spent the past 20 years building, testing, marketing, and selling enterprise storage, networking, and compute solutions in the rapidly evolving information technology industry. Today, his focus is on HPE Ezmeral: the ultimate toolkit to manage, deploy, execute, and monitor data-centric applications on software- and hardware-based architectures in the cloud, on premises, and at the edge.
- 3.1 About Don Wake
Spark is the ultimate toolkit
Data engineers often work in multiple, complicated environments and perform the complex, difficult, and, at times, tedious work necessary to make data systems operational. Their job is to get the data into a form where others in the data pipeline, like data scientists, can extract value from the data.
Spark has become the ultimate toolkit for data engineers because it simplifies the work environment by providing both a platform to organize and execute complex data pipelines, and a set of powerful tools for storing, retrieving, and transforming data.
Spark doesn’t do everything, and there are lots of important tools outside of Spark that data engineers love. But what Spark does is perhaps the most important thing: It provides a unified environment that accepts data in many different forms and allows all the tools to work together on the same data, passing a data set from one step to the next. Doing this well means you can create data pipelines at scale.
With Spark, data engineers can:
- Connect to different data sources in different locations, including cloud sources such as Amazon S3, databases, Hadoop file systems, data streams, web services, and flat files.
- Convert different data types into a standard format. The Spark data processing API allows the use of multiple different types of input data. Spark then utilizes Resilient Distributed Datasets (RDDs) and Data Frames for simplified, yet advanced data processing.
- Write programs that access, transform, and store the data. Many common programming languages have APIs to integrate Spark code directly, and Spark offers many powerful functions for performing complex ETL-style data cleaning and transformation functions. Spark also includes a high-level API that allows users to seamlessly write queries in SQL.
- Integrate with almost every important tool for data wrangling, data profiling, data discovery, and data graphing.
What makes Spark unique
In order to understand why Spark is so special, it is important to compare it to the Hadoop infrastructure, which was crucial earlier in the rise of big data and big data analytics.
It’s modular: Spark is essentially a modular toolkit initially designed to work with Hadoop via the YARN cluster manager interface. This pairing with Hadoop made sense as Hadoop provided both the compute and storage resources. Spark offered many tools for processing data and Hadoop handled large volumes of affordable persistent storage and scaling of the compute storage nodes. It became quickly apparent, however, that combining both compute and storage together wasn’t cost effective. Since then, many efficiencies have been introduced to support cloud-scale architectures, all in an attempt to decouple storage and compute. Spark is a valuable tool that could be used outside of Hadoop, and allows either resource to scale independently. Ultimately, this means regardless of what their organization’s favorite storage and compute infrastructure is, Spark empowers users to interface with that infrastructure.
Accepts data of any size and form: Spark emerged 10 years after Hadoop’s creation and was more focused on how data of any size could be combined to support the development of applications and analytical workloads. While Hadoop offered a variety of low-level capabilities, Spark provides a much broader and tailor-made environment that takes raw materials, turns them into reusable forms, and delivers them in analytic workloads. While Spark can work in a batch fashion, it can also work in an interactive fashion. As a result, Spark has become the go-to platform for most data applications and is especially well tailored to solving the problems of data engineering. Essentially, Spark outgrew Hadoop.
Supports multiple approaches and users: The Hadoop infrastructure ushered in the era of big data by creating a platform that could creatively and affordably store and process data at quantities never previously imagined, and then make that data usable through the MapReduce. But Spark was created to support the processes further up the stack as well. While Spark can access raw forms of data and interact with Hadoop file systems, Spark isn’t a single paradigm for achieving these aims, but instead built from the ground up to provide multiple approaches in processing architectures while using the same underlying data format.
Designed for the data engineer
Spark offers data engineers a large amount of elasticity and flexibility when approaching their work. For example, a TensorFlow job in Spark can access data in HDFS or multiple formats. Since Spark utilizes an API-driven approach, Spark engineers have a wide variety of tools to use with Spark as the analytics engine. This modularity enables use of open-source tools and avoids vendor lock-in with bespoke application-specific programming tools.
Since Spark now works with Kubernetes and containerization, engineers can spin up and spin down Spark clusters and manage them efficiently as Kubernetes pods versus relying on physical, standalone or bare-metal clusters. Deploying a Spark cluster on top of a Kubernetes cluster leverages the hardware abstraction layer managed by Kubernetes. This further frees up a data engineer to do data engineering and avoid the complex and often time-consuming work of IT administration and cluster management.
Spark is a tool that was created to not only solve the problem of data engineering, but also be accessible and helpful to the people who are further down the data pipeline. Thus, while Spark was designed for data engineers, it is actually increasing the number of people who can get value out of data. By offering scalable compute with scalable toolsets, Spark empowers engineers to empower others to leverage data to the fullest. Perhaps, then, not only is Spark a data engineer’s best friend — but is everybody’s best friend?
To see how Spark empowers a variety of users for streaming workloads, machine learning, analytics, and data engineering, check out my blog: Ready to become a superhero? Build an ML model with Spark on HPE Ezmeral now.
About Don Wake
Don has spent the past 20 years building, testing, marketing, and selling enterprise storage, networking, and compute solutions in the rapidly evolving information technology industry. Today, his focus is on HPE Ezmeral: the ultimate toolkit to manage, deploy, execute, and monitor data-centric applications on software- and hardware-based architectures in the cloud, on premises, and at the edge.
Copyright © 2021 IDG Communications, Inc.