TensorFlow 1. GradientDescentOptimizer Explanation on Linear Regression Model

To understand how TensorFlow works, we may need to learn a little bit the backend theory of some basic features. In my thought, the most wanted to know is the GradientDescentOptimizer feature. The core method of the feature is minimize. You can see the method everywhere in TensorFlow code.

First, let’s check the official document first. Here is the link for for GradientDescentOptimizer. Minimize,

https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer , copied here for detail.

Add operations to minimize loss by updating var_list

Ok, here is my understanding.

1. Minimize is an Operation (defined as Node) in the TensorFlow graph.  An Operation is a node in a TensorFlow Graph that takes zero or more Tensor objects as input, and produces zero or more Tensor objects as output.
2. When you run the Operation of Minimize, it will compute the gradient for you, then update the variables found in the graph.

So, why/how does the gradient compute, and why/how variables update?  I will make examples to illustrate them step by step.

It is a bad dog to good dog training game. If a dog behaves well, it will get reward, else punishment.

In supervised machine learning, you need to compare the predict result of each iteration with the correct result. If the predict result does not match your result, you need to adjust your training parameters. Then, how to adjust your training parameters?

Let’s make an simple example, . We want to decrease y according the change of x.  We have 2 questions.

1. 1st question:  Initially, x=+1.  So, we got y=1.  Now, you want to decrease y a little bit via the change of x. Do you make x larger or smaller?

It is so easy and 3rd grader can solve it. You just need to decrease x a little bit, such as x=+0.99

1. 2nd question: Initially, x=-1.  So, we got y=1.  What is your answer for same problem as above?

It is a piece of cake. The answer is to increase x a little bit, such as x=-0.99.

Wait a minute before you celebrate your celerity if you also answer correctly.

1. How can you know that if you do NOT calculate the y again?
2. If we have many independent variables in the equation, that is hard to guess. For example, , we want to make y increase a little bit at the point (x1,x2,x3)=(1,2,3). How do you change x1,x2 and x3? In the deep learning, you may have thousands of parameters!

That is the Gradient’s magic.  We can use Gradient to see whether to increase or decrease parameter x to make y smaller.  Here is how we do that for  in math

1. First, you need to know the Gradient’s magic: according our calculus theory, if you go with Gradient opposite direction, you definitely the best way to decrease your value.

The well-known for the magic is Gradient descent.  According to Wikipedia, the gradient is a multi-variable generalization of the derivative. So, in here (), it has only one variable, so, what we need is just derivative.

According to the equation, we got:

1.  when  and ;
2.  when  and >0;

We can use verbal description for the above math inequation as below:

If we update x with the opposite direction of, you can always decrease. Let’s see two simple examples: f(x)=x  and f(x)=-x

We can graph them as below.

f(x)=x                                                           f(x)=-x

1. f(x)=x:  . So, in each point x=x0, if you want to decrease y0 a little bit, you need to set update x0 with small negative. That is
2. f(x)=-x:  . So, in each point x=x0, if you want to decrease y0 a little bit, you need to update x0 with small positive. That is

The formulas of update of x0 are same for both case.

That is the math theory behind the magic. It will hold true for multi-variable equation.

1. Multi-variable equation: ref to https://commons.wikimedia.org/wiki/File:Gradient_descent.svg

The above diagram is a contour for z.

Or, we can plot the surface of the z function as below

From the equation, or from the derivative functions in terms of x and y, or from the surface diagram, we can easy know that the min of z is -1 when (x,y)=0 as

1. Z is the sum of two square of variables and -1, or
2. Set the derivative functions to zero, you will get (x,y)=(0,0)
3. The bluest place is the flattest location, represented 0 of derivative.

At the beginning, z0=f(x0,y0).  The label x0 in the Wikimedia contour diagram represents point (x0,y0)= (-0.6498,-1.0212).  If we want to decrease z0, what is our next step x1=(x1,y1)?

According to our formula (2), we can calculate the gradient direction  as below.

So, according to our magic,  we should update our x0 with the opposite direction of ..

We can plot the contour, the x0, the ) with Desmo as below. You can see line x0x1 is parrelle to , but with oppostive direction.

1. How TensorFlow GradientDescentOptimizer. Minimize runs

The first part “Gradient Matter” explains a lot on the basic concept of gradient descent. That is the backend theory of TensorFlow GradientDescentOptimizer. Minimize.

Now, we will use an TensorFlow code to validate that GradientDescentOptimizer. Minimize follows the theory.

1. The Linear Regression problem

The classic linear regression problem is to draw a line to fit n scatter points.

In our example, we will generate those points randomly with noise in the line of y=2x+6 (of course, you do not know the line.) You need to find out the line. In math, that is to find out the fittest parameters of a and b in the line equation y=ax+b.

1. Tensorflow code to solve the problem: (ref: https://learningtensorflow.com/lesson7/)

import tensorflow as tf

import numpy as np

# x and y are placeholders for our training data

x = tf.placeholder(“float”)

y = tf.placeholder(“float”)

# w is the variable storing our values. It is initialised with starting “guesses”

# w[0] is the “a” in our equation, w[1] is the “b”

w = tf.Variable([1.0, 2.0], name=”w”)

# Our model of y = a*x + b

y_model = tf.multiply(x, w[0]) + w[1]

# Our error is defined as the square of the differences

error = tf.square(y – y_model)

# The Gradient Descent Optimizer does the heavy lifting

# Normal TensorFlow – initialize values, create a session and run the model

model = tf.global_variables_initializer()

with tf.Session() as session:

session.run(model)

for i in range(1000):

x_value = np.random.rand()

y_value = x_value * 2 + 6

print “before x,y,W:%s,%s,%s” % (x_value,y_value,session.run(w))

session.run(train_op, feed_dict={x: x_value, y: y_value})

print “afterW:%s” % (session.run(w))

w_value = session.run(w)

print(“Predicted model: {a:.3f}x + {b:.3f}”.format(a=w_value[0], b=w_value[1]))

The printed result as below:

before x,y,W:0.140416003299,6.2808320066,[1. 2.]

afterW:[1.0116276 2.0828083]

before x,y,W:0.652577668566,7.30515533713,[1.0116276 2.0828083]

afterW:[1.0711712 2.1740518]

before x,y,W:0.531842036294,7.06368407259,[1.0711712 2.1740518]

afterW:[1.1171217 2.2604506]

before x,y,W:0.748193761765,7.49638752353,[1.1171217 2.2604506]

afterW:[1.1829644 2.3484528]

before x,y,W:0.797366616535,7.59473323307,[1.1829644 2.3484528]

afterW:[1.2515862 2.4345133]

before x,y,W:0.458017287808,6.91603457562,[1.2515862 2.4345133]

afterW:[1.2873874 2.5126789]

….

before x,y,W:0.63061269596,7.26122539192,[2.2779834 5.8520336]

afterW:[2.2776387 5.851487 ]

before x,y,W:0.759848576593,7.51969715319,[2.2776387 5.851487 ]

afterW:[2.2766895 5.8502383]

before x,y,W:0.654521160378,7.30904232076,[2.2766895 5.8502383]

afterW:[2.2762792 5.8496118]

Predicted model: 2.276x + 5.850

1. Why the Tensorflow code can solve the problem?

Our model is y = a*x + b.  Be noted that our parameters are a and b, instead of x.

We have the sample data before each loop.  In the code, a and b are combined as tensor variable w.

In math linear regression model, the fittest line for sample data is to find the a and b to minimize square error between predict y and the sample y, denoted as

The TensorFlow code follows the math linear regression model. It uses sample data one by one to decrease via the update of a and b, using gradient descent magic.  As each iteration will decrease , the final will be approach the real minimal value of after hundreds of iterations.  The w in the final E(w) will be the parameters a and b that we are looking for.

Let’s calculate manually to see whether it matches Tensorflow printed result of the code.

*1) First sample

We have initial/guess value (a,b)=(1,2) and first sample value (x,y)=(0.140416003299,6.2808320066).

The data is from the first two lines of print result, copied here for quick reference:

before x,y,W:0.140416003299,6.2808320066,[1. 2.]

afterW:[1.0116276 2.0828083]

Recall gradient descent magic, to decrease E(w)value, we need to update parameter values with the opposite direction of gradient.

*2) Second sample

Let’s try another set of (x,y),(a,b) to calculate manually for the update of a and b.

before x,y,W:0.63061269596,7.26122539192,[2.2779834 5.8520336]

afterW:[2.2776387 5.851487 ]

The update of a and b in the results of two examples are matched the printed result of TensorFlow.

1. Improve the accuracy via more iterations:

The printed result of the model “2.276x + 5.850” seems not so match our random data model of “2x+6”. That is caused by the small iterations. If we change iterations from 1000 to 3000 in code “for i in range(3000):”, You will get a much more accurate result as below:

Predicted model: 2.019x + 5.990