PySpark Quick Start [Introduction to Apache Spark for Python Developers]

PySpark Quick Start

In this PySpark quick start, let’s cover Apache Spark with Python fundamentals to get you started and feeling comfortable about using PySpark. The intention is for readers to understand basic PySpark concepts through examples.  Later posts will deeper dive into Apache Spark fundamentals and example use cases. Apache Spark is a distributed computing framework widely used … Read more

PySpark DataFrames by Example

What are PySpark Dataframes?

PySpark DataFrames are a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external … Read more

PySpark Examples of Actions

PySpark Examples of Actions

PySpark actions produce a computed value back to the Spark driver program.  This is different from PySpark transformation functions which produce RDDs, DataFrames or DataSets in results.  For example, an action function such as count will produce a result back to the Spark driver while a collect transformation function will not.  These may seem easy … Read more