PySpark DataFrames by Example

What are PySpark Dataframes?

PySpark DataFrames are a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external […]

How to Deploy Python Programs to a Spark Cluster

Python Program Deploy to Spark Cluster

After you have a Spark cluster running, how do you deploy Python programs to a Spark Cluster?  It’s not as straightforward as you might think or hope, so let’s explore further in this PySpark tutorial. PySpark Application Deploy Overview Let’s deploy a couple of examples of Spark PySpark program to our cluster. Let’s start with […]

PySpark Quick Start

Apache Spark Python Tutorial

In this post, let’s cover Apache Spark with Python fundamentals to get you started and feeling comfortable about using PySpark. The intention is for readers to understand basic PySpark concepts through examples.  Later posts will deeper dive into Apache Spark fundamentals and example use cases. Spark computations can be called via Scala, Python or Java.  There […]

Connect ipython notebook to Apache Spark Cluster

ipython spark

This post will quickly cover how to connect an ipython notebook to two kinds of Spark Clusters: Spark Cluster running in Standalone mode and a Spark Cluster running on Amazon EC2. What is ipython? An interactive computing environment called IPython Notebook enables users to create and share documents with real-time code, equations, visuals, and text. […]

PySpark Action Examples

Apache Spark Action Examples in Python

PySpark action functions produce a computed value back to the Spark driver program.  This is different from PySpark transformation functions which produce RDDs, DataFrames or DataSets in results.  For example, an action function such as count will produce a result back to the Spark driver while a collect transformation function will not.  These may seem […]

PySpark Transformations in Python Examples

Spark Transformations with Python Examples

If you’ve read the previous PySpark tutorials on this site, you know that Spark Transformation functions produce a DataFrame, DataSet or Resilient Distributed Dataset (RDD).  Resilient distributed datasets are Spark’s main programming abstraction and RDDs are automatically parallelized across the cluster.  As Spark matured, this abstraction changed from RDDs to DataFrame to DataSets, but the […]

Apache Spark and ipython notebook – The Easy Way


Using ipython notebook with Apache Spark couldn’t be easier.  This post will cover how to use ipython notebook (jupyter) with Spark and why it is best choice when using python with Spark. Requirements This post assumes you have downloaded and extracted Apache Spark and you are running on a Mac or *nix.  If you are […]