
Spark Machine Learning is contained with Spark MLlib. Spark MLlib Spark’s library of machine learning (ML) functions designed to run in parallel on clusters. MLlib contains a variety of learning algorithms. The topic of machine learning itself could fill many books, so instead, this chapter explains ML in Apache Spark.
Spark ML Overview
MLlib invokes various algorithms on RDDs. As an example, MLlib could be used to identify spam through the following:
1) Create an RDD of strings representing email.
2) Run one of MLlib’s feature extraction algorithms to convert text into an RDD of vectors.
3) Call a classification algorithm on the RDD of vectors to return a model object to classify new points.
4) Evaluate the model on a test dataset using one of MLlib’s evaluation functions.
Some classic ML algorithms are not included with Spark MLib because they were not designed for parallel computations. MLlib contains several recent research algorithms for clusters, such as distributed random forests, K-means | |, and alternating least squares. MLlib is best suited for running machine learning algorithms on large datasets.
Spark Machine Learning Basics
Machine learning algorithms try to predict or make decisions based on training data. There are multiple types of learning problems, including classification, regression, or clustering. All of which have different objectives.
All learning algorithms require defining a set of features for each item. Then this set of features is sent into the learning function. For example, for an email, a set of features might include the server it comes from, the number of mentions of the word free, or the color of the text.
Pipelines often train multiple versions of a model and evaluate each one. To do this, separate the input data into “training” and “test” sets.
A very simple program for building a spam classifier in python is shown. The code and data files are available in the book’s Git repository.
Data Types
MLlib contains a few specific data types including Vector, LabeledPoint, Rating, and various Model classes
Working with Vectors
Vectors come in two flavors: dense and sparse. Dense vectors store all their entries in an array of floating-point numbers. Sparse vectors are usually preferable because they store nonzero values and their indices.
Algorithms
The key algorithms available in MLlib are Feature Extraction, TF-IDF, Scaling, Normalization, Word2Vec
Statistics
Various statistics functions are available.
Classification and Regression
Classification and regression are two common forms of supervised learning where the difference between the two is the type of variable predicted.
Linear regression is one of the most common methods for regression. This method predicts the output variable as a linear combination of the features.
Logistic regression is a binary classification method which identifies a linear separating plane between positive and negative examples such as Support Vector Machines (SVM).
Naive Bayes is a multiclass classification algorithm scoring how well each point belongs in each class based on a linear function of the features.
Decision trees are a flexible model used for both classification and regression.
Clustering
Clustering is an unsupervised learning task involving the grouping of objects into clusters of high similarity.
MLlib includes the popular K-means algorithm for clustering, as well as a variant called K-means | |.
Collaborative Filtering and Recommendation
Collaborative filtering is a technique for recommender systems. Users’ ratings and interactions with various products are used to make recommendations.
MLlib includes Alternating Least Squares (ALS) which is a popular algorithm for collaborative filtering.
Dimensionality Reduction
The main technique for dimensionality reduction used by the machine learning community is principal component analysis (PCA), but MLlib also provides the lower-level singular value decomposition (SVD) primitive.
Model Evaluation
Many learning tasks may be addressed with different models.
Tips and Performance Considerations
The following are suggested to consider: feature preparation, algorithm configuration, RDD caching, recognizing sparsity and level of parallelism.
Pipeline API
Based on the concept of pipelines, starting in Spark 1.2, MLlib is adding a new, higher-level API for machine learning. The pipeline API is similar to the one found in SciKit-Learn.
Featured image photo credit https://flic.kr/p/5tDCdT