Skip to content
Ondřej Moravčík edited this page Mar 25, 2015 · 15 revisions

A statistical technique for estimating the relationships among variables.

Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted X.

Least squares

data = [
  LabeledPoint.new(0.0, [0.0]),
  LabeledPoint.new(1.0, [1.0]),
  LabeledPoint.new(3.0, [2.0]),
  LabeledPoint.new(2.0, [3.0])
]
lrm = LinearRegressionWithSGD.train($sc.parallelize(data), initial_weights: [1.0])

lrm.intercept # => 0.0
lrm.weights   # => [0.9285714285714286]

lrm.predict([0.0]) < 0.5
# => true

lrm.predict([1.0]) - 1 < 0.5
# => true

lrm.predict(SparseVector.new(1, {0 => 1.0})) - 1 < 0.5
# => true

Lasso

An alternative regularized version of least squares

data = [
  LabeledPoint.new(0.0, [0.0]),
  LabeledPoint.new(1.0, [1.0]),
  LabeledPoint.new(3.0, [2.0]),
  LabeledPoint.new(2.0, [3.0])
]
lrm = LassoWithSGD.train($sc.parallelize(data), initial_weights: [1.0])

lrm.predict([0.0]) - 0 < 0.5
# => true

lrm.predict([1.0]) - 1 < 0.5
# => true

lrm.predict(SparseVector.new(1, {0 => 1.0})) - 1 < 0.5
# => true

Ridge

For non-linear least-squares problems.

data = [
    LabeledPoint.new(0.0, [0.0]),
    LabeledPoint.new(1.0, [1.0]),
    LabeledPoint.new(3.0, [2.0]),
    LabeledPoint.new(2.0, [3.0])
]
lrm = RidgeRegressionWithSGD.train($sc.parallelize(data), initial_weights: [1.0])

lrm.predict([0.0]) - 0 < 0.5
# => true

lrm.predict([1.0]) - 1 < 0.5
# => true

lrm.predict(SparseVector.new(1, {0 => 1.0})) - 1 < 0.5
# => true
Clone this wiki locally