Icone fr 02

Spark and Machine Learning (MLlib)

Author: Antonio Berti

Translator: Sabrina Sala


In this tutorial we are going to describe the use of Apache Foundation’s library for Machine Learning: the so-called MLlib.

MLib is one of Spark’s API and it is interoperable with Python NumPy as well as R libraries. If it is developed with Spark, it is possible to use any type of Hadoop data source of Hadoop platform, e.g., HDFS, HBase, data sources coming from relational databases or local data sources such as text files.

Spark execels at interative computation, enabling MLlib to run fast and also allowing companies to use it in their business activity.

MLlib provides different types of algorithm, along with many utility functions. ML also includes classification algorithms, regression ones, decision trees, recommendation and clustering algorithms. Among the most popular utilities we may include trsformation, standardization and normalization, as well as statistical and linear algebra’s functions.

With the following code we wuold like to explain how to develop a simple logistic regression model using MLlib.


First of all we load the dataset.

val data =MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

The dataset is then divided in two parts, one used for training the model (60%) and the other for the test (40%).

val splits = data.randomSplit(Array(0.6, 0.4), seed=11L)

val training = splits(0).cache()

val test = splits(1)

Next, we train the algorithm and build the model.

val model = new LogisticRegressionWithLBFGS()



The model is run on the test dataset.

val predictionAndLabels = test.map {case LabeledPoint(label, features) =>

 val prediction = model.predict(features)

  (prediction, label)}

By doing so, we can collect model metric and the predictive accurancy.

val metrics = new MulticlassMetrics(predictionAndLabels)

val accuracy = metrics.accuracy

println("Accuracy = $accuracy")

It is now possible to store the model after training fase is over, so that we can recall it as and when required.

model.save(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")

val sameModel = LogisticRegressionModel.load(sc,




Spark- http://spark.apache.org

MLlib – http://spark.apache.org/docs/latest/ml-guide.html

Data examples – https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples

Sample dataset of MLlib – https://github.com/apache/spark/tree/master/data/mllib