# Course.Machine Learning

## Teacher

## What is Machine Learning ?

- Arthur Samuel (1959)

- Machine learning: «Field of study that gives computers the ability to learn without being explicitly programmed»
- Checkers playing program work out which board positions were good and bad depending on wins/losses.

- Machine learning: «Field of study that gives computers the ability to learn without being explicitly programmed»
- Tom Michel (1999)

- Well posed learning problem: «A computer program is said to learn from
**experience E**with respect to some class of**tasks T**and**performance measure P**, if its performance at tasks in T, as measured by P, improves with experience E.» - The checkers example

- E = 10000s games
- T is playing checkers
- P if you win or not

- E = 10000s games

- Well posed learning problem: «A computer program is said to learn from

### Several types of learning algorithms

- Supervised learning

- Teach the computer how to do something, then let it use it;s new found knowledge to do it

- Teach the computer how to do something, then let it use it;s new found knowledge to do it
- Unsupervised learning

- Let the computer learn how to do something, and use this to determine structure and patterns in data

- Let the computer learn how to do something, and use this to determine structure and patterns in data
- Reinforcement learning
- Recommender systems

## Supervised learning

@todo definition

Given the «right answer» for each example in the data.

### Regression

Predict real-valued output

### Classification

Predict discrete-valued output

## Unsupervised learning

@todo definition

### Clustering

#### Examples of clustering

- News classification (Google News)
- Market segmentation
- Social network analysis

### Cocktail party problem algorithm

## Linear regression with one variable

### Model representation of Linear regression with one variable

- Training set
- Learning Algorithm
- (m) - number of training examples
- (x) - input variable / features
- (y) - output variable / target variable
- ((x,y)) - one training example
- (h(x)) - hypothesis function / mapping function from input to target

**Training Set** feeded in -> **Learning Algorithm** outputs -> **Hypothesis**

Hypothesis: (h*{\theta}(x) = \theta*{0} + \theta*{1}x)
Parameters: (\theta*{i’s})

#### How to choose (\theta_{i’s}) parameters ?

Choose (\theta*{0}), (\theta*{1}) so that (h_{\theta}(x)) is close to (y) for our training examples ((x,y)).

##### Solve minimization problem

Our definition of ‘cost’ = average of the sum of the squared error, often called the Mean Squared Error.

Cost function (\mathcal{J}(\theta*{0},\theta*{1}) = \frac{1}{2m} \sum\limits*{i = 1}^m [ h*{\theta}( x_i ) - y*i ]^2 )
Minimize over (\theta*{0},\theta_{1}) of cost function.

## Gradient Descent algroithm

Use to minimize Cost function (J)

Gradient Descent used all over machine learning and to minimize other functions

Start by looking at general function (J)

Problem:

- Have some function (\mathcal{J}(\theta
*{0},\theta*{1})) - Want (min \mathcal{J}(\theta
*{0},\theta*{1}))

Gradient Descent applies to more general functions

- (\mathcal{J}(\theta
*{0},\theta*{1}… \theta_{n})) - (min \mathcal{J}(\theta
*{0},\theta*{1}… \theta_{n}))

### How does it work ?

- Start with some (\theta
*{0},\theta*{1}) or 0,0 - Keep changing (\theta
*{0},\theta*{1}) to reduce (\mathcal{J}(\theta*{0},\theta*{1})) until end up at minimum.

Start value corresponds to some point on surface.

Where you start gradient descent determines which local optimum you end up.

### Formal definition

Repeat until convergence

(\theta*{j} := \theta*{j} - \alpha\frac{\partial}{\partial\theta*{j}} \mathcal{J}(\theta*{0},\theta_{1})), (for j = 0 and j = 1)

- (:=) - means assign right hand to left hand
- (\alpha) - learning rate, controls how big is step we take
- (\frac{\partial}{\partial\theta
*{j}} \mathcal{J}(\theta*{0},\theta_{1})) - partial derivative of the cost function

Update (θ*{j}) by setting it to (θ*{j} - \alpha) times the partial derivative of the cost function with respect to (θ_{j})

There is subtly about implementation, we need to make sumaltaneous update

(temp0 := \theta*{0} - \alpha\frac{\partial}{\partial\theta*{0}} \mathcal{J}(\theta*{0},\theta*{1}))

(temp1 := \theta*{1} - \alpha\frac{\partial}{\partial\theta*{1}} \mathcal{J}(\theta*{0},\theta*{1}))

(\theta*{0} := temp0)
(\theta*{1} := temp1)

### Understanding the algoritm

To understand gradient descent, we’ll return to a simpler function where we minimize one parameter to help explain the algorithm in more detail

- (min\ \mathcal{J}(\theta
*{1})) where (\theta*{1}) is a real number - Two key terms in the algorithm

- Alpha
- Derivative term

- Alpha
- Notation nuances

- Partial derivative vs. derivative

- Use partial derivative when we have multiple variables but only derive with respect to one
- Use derivative when we are deriving with respect to all the variables

- Use partial derivative when we have multiple variables but only derive with respect to one

- Partial derivative vs. derivative

(\theta*{1} := \theta*{1} - \alpha \frac{d}{d\theta_{1}}\mathcal{J}(\theta_1))

(\frac{d}{d\theta_{1}}\mathcal{J}(\theta_1)) derivative says

- Lets take tangent at the point and look at slope of the line
- Line has positive slope so it has positive derivative (\frac{d}{d\theta_{1}}\mathcal{J}(\theta_1) \geq 0 )
- Moving towards the minimum (\theta
*{1} := \theta*{1} - \alpha (some\ positive\ number))) to a smaller value - Negative slope results in bigger values as moving to minimum
- What happens if (\alpha) is too small or too large

- Too small

- too long

- too long
- Too large

- Can overshoot the minimum and fail to converge

- Can overshoot the minimum and fail to converge

- Too small
- When you get to a local minimum

- Gradient of tangent/derivative is 0
- (\theta_1) remains same

- Gradient of tangent/derivative is 0
- As approach the global minimum the derivative term gets smaller, magnitude of update to (\theta_1) gets smaller.

## Linear regression with gradient descent

Apply gradient descent to minimize the squarred error cost function (\mathcal{J}(\theta

*{0},\theta*{1}))

(\frac{\partial}{\partial\theta*{j}} \mathcal{J}(\theta*{0},\theta*{1}) =*{j}} \frac{1}{2m} \sum\limits

\frac{\partial}{\partial\theta*{i = 1}^m [ h*{\theta}( x_i ) - y*i ]^2 =*{j}} \frac{1}{2m} \sum\limits_{i = 1}^m [ \theta_0 + \theta_1 x_i - y_i ]^2 )

\frac{\partial}{\partial\thetaDetermine derivative for each parameter, you need to know multivariate calculus

(j = 0: \frac{\partial}{\partial\theta*{0}} \mathcal{J}(\theta*{0},\theta*{1}) =*{i = 1}^m [ h_{\theta}( x_i ) - y

\frac{1}{m} \sum\limits*i ]^2 )*{1}} \mathcal{J}(\theta

(j = 1: \frac{\partial}{\partial\theta*{0},\theta*{1}) =

\frac{1}{m} \sum\limits*{i = 1}^m [ h*{\theta}( x_i ) - y_i ]^2 x_i)Plug back into the gradient descent algorithm

(\theta*{0} := \theta*{0} - \alpha\frac{1}{m} \sum\limits*{i = 1}^m [ h*{\theta}( x_i ) - y*i ]^2 )*{1} := \theta

(\theta*{1} - \alpha\frac{1}{m} \sum\limits*{i = 1}^m [ h_{\theta}( x_i ) - y_i ]^2 x_i )

Linear regression cost function is always a convex function - always has a singe minimum.

- This is actually
**Batch Gradient Descent**

- Refers to the fact that over each step you look at all the training data

- Each step compute over m training examples

- Each step compute over m training examples
- Sometimes non-batch versions exist, which look at small data subsets

- We’ll look at other forms of gradient descent (to use when m is too large) later in the course

- We’ll look at other forms of gradient descent (to use when m is too large) later in the course

- Refers to the fact that over each step you look at all the training data
- There exists a numerical solution for finding a solution for a minimum function

**Normal equations**method- Gradient descent scales better to large data sets though
- Used in lots of contexts and machine learning

## Normal equation method

- To solve the minimization problem we can solve it (min\ \mathcal{J}(\theta
*{0},\theta*{1})) exactly using a numeric method which avoids the iterative approach used by gradient descent.

## Learn with a larger number of features

- With multiple features becomes hard to plot
- Notation becomes more complicated too
- Need linear algebra for more complex linear regression modles