What is Spark RDD?
Spark RDD is short for Apache Spark Resilient Distributed Dataset. A Spark Resilient Distributed Dataset is often shortened to simply RDD. RDDs are a foundational component of the Apache Spark large scale data processing framework.
Spark RDDs are an immutable, fault-tolerant, and possibly distributed collection of data elements. RDDs may be operated on in parallel across a cluster of computer nodes. To operate in parallel, RDDs are divided into logical partitions. Partitions are computed on different nodes of the cluster through Spark Transformation APIs. RDDs may contain a type of Python, Java, or Scala objects, including user-defined classes.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value by performing a computation on the RDD.
How are Spark RDDs created?
Spark RDDs are created through the use of Spark Transformation functions. Transformation functions create new RDDs from a variety of sources; e.g.
textFile function from a local filesystem, Amazon S3 or Hadoop’s HDFS. Transformation functions may also be used to create new RDDs from previously created RDDs. For example, an RDD of all the customers from only North America could be constructed from an RDD of all customers throughout the world.
In addition to loading text files from file systems, RDDs may be created from external storage systems such as JDBC databases such as mySQL, HBase, Hive, Casandra or any data source compatible with Hadoop Input Format.
RDDs are also created and manipulated when using Spark modules such as Spark Streaming and Spark MLlib.
Why Spark RDD?
Spark makes use of data abstraction through RDDs to achieve faster and more efficient performance than Hadoop’s MapReduce.
RDDs support in-memory processing. Accessing data from memory is 10 to 100 times faster than accessing data from a network or disk. Data access from disk often occurs in Hadoop’s MapReduce-based processing.
In addition to performance gains, working through an abstraction layer provides a convenient and consistent way for developers and engineers to work with a variety of data sets.
When to use Spark RDDs?
RDDs are utilized to perform computations on an RDD dataset through Spark Actions such as a
reduce when answering questions such as “how many times did xyz happen?” or “how many times did xyz happen by location?”
Often, RDDs are transformed into new RDDs in order to better prepare datasets for future processing downstream in the processing pipeline. To reuse a previous example, let’s say you want to examine North America customer data and you have an RDD of all worldwide customers in memory. It could be beneficial from a performance perspective to create a new RDD for North America only customers instead of using the much larger RDD of all worldwide customers.
Depending on the Spark operating environment and RDD size, RDDs should be cached (via
cache function) or persisted to disk when there is an expectation for the RDD to be utilized more than once.
Spark Tutorial Landing page
Featured Image credit https://flic.kr/p/7TqgUV