github twitter email rss
Course: Machine Learning - Andrew Ng @ Coursera
0001 Jun 1
5 minutes read

Course.Machine Learning


Andrew Ng

What is Machine Learning ?

  • Arthur Samuel (1959)
    • Machine learning: «Field of study that gives computers the ability to learn without being explicitly programmed»
    • Checkers playing program work out which board positions were good and bad depending on wins/losses.
  • Tom Michel (1999)
    • Well posed learning problem: «A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.»
    • The checkers example
      • E = 10000s games
      • T is playing checkers
      • P if you win or not

Several types of learning algorithms

  • Supervised learning
    • Teach the computer how to do something, then let it use it;s new found knowledge to do it
  • Unsupervised learning
    • Let the computer learn how to do something, and use this to determine structure and patterns in data
  • Reinforcement learning
  • Recommender systems

Supervised learning

@todo definition
Given the «right answer» for each example in the data.


Predict real-valued output


Predict discrete-valued output

Unsupervised learning

@todo definition


Examples of clustering

  • News classification (Google News)
  • Market segmentation
  • Social network analysis

Cocktail party problem algorithm

Linear regression with one variable

Model representation of Linear regression with one variable

  • Training set
  • Learning Algorithm
  • (m) - number of training examples
  • (x) - input variable / features
  • (y) - output variable / target variable
  • ((x,y)) - one training example
  • (h(x)) - hypothesis function / mapping function from input to target

Training Set feeded in -> Learning Algorithm outputs -> Hypothesis

Hypothesis: (h{\theta}(x) = \theta{0} + \theta{1}x)
Parameters: (\theta

How to choose (\theta_{i’s}) parameters ?

Choose (\theta{0}), (\theta{1}) so that (h_{\theta}(x)) is close to (y) for our training examples ((x,y)).

Solve minimization problem

Our definition of ‘cost’ = average of the sum of the squared error, often called the Mean Squared Error.
Cost function (\mathcal{J}(\theta{0},\theta{1}) = \frac{1}{2m} \sum\limits{i = 1}^m [ h{\theta}( x_i ) - yi ]^2 )
Minimize over (\theta
{0},\theta_{1}) of cost function.

Gradient Descent algroithm

Use to minimize Cost function (J)
Gradient Descent used all over machine learning and to minimize other functions

Start by looking at general function (J)


  • Have some function (\mathcal{J}(\theta{0},\theta{1}))
  • Want (min \mathcal{J}(\theta{0},\theta{1}))

Gradient Descent applies to more general functions

  • (\mathcal{J}(\theta{0},\theta{1}… \theta_{n}))
  • (min \mathcal{J}(\theta{0},\theta{1}… \theta_{n}))

How does it work ?

  • Start with some (\theta{0},\theta{1}) or 0,0
  • Keep changing (\theta{0},\theta{1}) to reduce (\mathcal{J}(\theta{0},\theta{1})) until end up at minimum.

Start value corresponds to some point on surface.
Where you start gradient descent determines which local optimum you end up.

Gradient Descent

Formal definition

Repeat until convergence
(\theta{j} := \theta{j} - \alpha\frac{\partial}{\partial\theta{j}} \mathcal{J}(\theta{0},\theta_{1})), (for j = 0 and j = 1)

  • (:=) - means assign right hand to left hand
  • (\alpha) - learning rate, controls how big is step we take
  • (\frac{\partial}{\partial\theta{j}} \mathcal{J}(\theta{0},\theta_{1})) - partial derivative of the cost function

Update (θ{j}) by setting it to (θ{j} - \alpha) times the partial derivative of the cost function with respect to (θ_{j})

There is subtly about implementation, we need to make sumaltaneous update
(temp0 := \theta{0} - \alpha\frac{\partial}{\partial\theta{0}} \mathcal{J}(\theta{0},\theta{1}))
(temp1 := \theta{1} - \alpha\frac{\partial}{\partial\theta{1}} \mathcal{J}(\theta{0},\theta{1}))
(\theta{0} := temp0)
{1} := temp1)

Understanding the algoritm

To understand gradient descent, we’ll return to a simpler function where we minimize one parameter to help explain the algorithm in more detail

  • (min\ \mathcal{J}(\theta{1})) where (\theta{1}) is a real number
  • Two key terms in the algorithm
    • Alpha
    • Derivative term
  • Notation nuances
    • Partial derivative vs. derivative
      • Use partial derivative when we have multiple variables but only derive with respect to one
      • Use derivative when we are deriving with respect to all the variables

(\theta{1} := \theta{1} - \alpha \frac{d}{d\theta_{1}}\mathcal{J}(\theta_1))

(\frac{d}{d\theta_{1}}\mathcal{J}(\theta_1)) derivative says

  • Lets take tangent at the point and look at slope of the line
  • Line has positive slope so it has positive derivative (\frac{d}{d\theta_{1}}\mathcal{J}(\theta_1) \geq 0 )
  • Moving towards the minimum (\theta{1} := \theta{1} - \alpha (some\ positive\ number))) to a smaller value
  • Negative slope results in bigger values as moving to minimum
  • What happens if (\alpha) is too small or too large
    • Too small
      • too long
    • Too large
      • Can overshoot the minimum and fail to converge
  • When you get to a local minimum
    • Gradient of tangent/derivative is 0
    • (\theta_1) remains same
  • As approach the global minimum the derivative term gets smaller, magnitude of update to (\theta_1) gets smaller.

Linear regression with gradient descent

  • Apply gradient descent to minimize the squarred error cost function (\mathcal{J}(\theta{0},\theta{1}))
    (\frac{\partial}{\partial\theta{j}} \mathcal{J}(\theta{0},\theta{1}) =
    {j}} \frac{1}{2m} \sum\limits{i = 1}^m [ h{\theta}( x_i ) - yi ]^2 =
    {j}} \frac{1}{2m} \sum\limits_{i = 1}^m [ \theta_0 + \theta_1 x_i - y_i ]^2 )

  • Determine derivative for each parameter, you need to know multivariate calculus
    (j = 0: \frac{\partial}{\partial\theta{0}} \mathcal{J}(\theta{0},\theta{1}) =
    \frac{1}{m} \sum\limits
    {i = 1}^m [ h_{\theta}( x_i ) - yi ]^2 )
    (j = 1: \frac{\partial}{\partial\theta
    {1}} \mathcal{J}(\theta{0},\theta{1}) =
    \frac{1}{m} \sum\limits{i = 1}^m [ h{\theta}( x_i ) - y_i ]^2 x_i)

  • Plug back into the gradient descent algorithm
    (\theta{0} := \theta{0} - \alpha\frac{1}{m} \sum\limits{i = 1}^m [ h{\theta}( x_i ) - yi ]^2 )
    {1} := \theta{1} - \alpha\frac{1}{m} \sum\limits{i = 1}^m [ h_{\theta}( x_i ) - y_i ]^2 x_i )

Linear regression cost function is always a convex function - always has a singe minimum.

  • This is actually Batch Gradient Descent
    • Refers to the fact that over each step you look at all the training data
      • Each step compute over m training examples
    • Sometimes non-batch versions exist, which look at small data subsets
      • We’ll look at other forms of gradient descent (to use when m is too large) later in the course
  • There exists a numerical solution for finding a solution for a minimum function
    • Normal equations method
    • Gradient descent scales better to large data sets though
    • Used in lots of contexts and machine learning

Normal equation method

  • To solve the minimization problem we can solve it (min\ \mathcal{J}(\theta{0},\theta{1})) exactly using a numeric method which avoids the iterative approach used by gradient descent.

Learn with a larger number of features

  • With multiple features becomes hard to plot
  • Notation becomes more complicated too
  • Need linear algebra for more complex linear regression modles

Back to posts

comments powered by Disqus