"This function will \"activate\" a neuron only if the input value is positive. Note that with this activation function, the activation level is not restricted to be between 0 an 1. Advantages of ReLU is that \n",

"This function will \"activate\" a neuron only if the input value is positive. Note that with this activation function, the activation level is not restricted to be between 0 an 1. Advantages of ReLU are that \n",

"- they are cheap to compute (later on, we are going to use millions of these units so we need to take that into account)\n",

"- although its derivative is not continuous, is has nice properties for optimization purposes (the gradient does not vanish for large values of x, more on that later)"

"where $\\mathbf y_m$ is the true output of the $m^{th}$ sample and $\\mathbf {\\hat y}_m$ is our estimated value of the output for that sample. The sum spans the entire training set of size $N$. Our task here is to find the best value of the parameters that minimize that cost function.\n",

"where $\\mathbf y_m$ is the true output of the $m^{th}$ sample and $\\mathbf {\\hat y}_m$ is our estimated value of the output for that sample. The sum spans the entire training set of size $M$. Our task here is to find the best value of the parameters that minimize that cost function.\n",

"\n",

"For an activation function $\\sigma$, the cost function for an individual input writes\n",

"The parameter $\\lambda$ is called the **learning rate**.\n",

"\n",

"So in the limit where linearity holds... gradient descent\n",

"\n"

"So in the limit where linearity holds we can compute the little increments in the weights and biases that ensure that the cost function will decrease. This method is called the **Gradient descent**.\n",

"\n",

"1d"

]

},

{

...

...

@@ -932,9 +931,9 @@

"> ***Question***\n",

">\n",

"> - What is the problem if $\\lambda$ is too small? too big?\n",

"> - What happens if the cost function is a complicated function of $\\mathbf w$ with local minima?\n",

"> - What happens if the cost function is a complicated function of $\\mathbf w$ with many local minima?\n",

"\n",

"In practice, the gradient descent method works well but is very slow to converge. There are other methods that have better convergence properties for this iterative process: [Newton-Raphson](https://en.wikipedia.org/wiki/Newton%27s_method), [Conjugate gradient](https://en.wikipedia.org/wiki/Conjugate_gradient_method), etc."

"In practice, the gradient descent method works well but is very slow to converge. There are other methods that have better convergence properties for this iterative process: [Newton-Raphson](https://en.wikipedia.org/wiki/Newton%27s_method), [Conjugate gradient](https://en.wikipedia.org/wiki/Conjugate_gradient_method), etc (see turorial)"

]

},

{

...

...

@@ -949,6 +948,18 @@

"### Hidden layers"

]

},

{

"cell_type": "markdown",

"id": "6c06983e",

"metadata": {

"slideshow": {

"slide_type": "subslide"

}

},

"source": [

"In the perceptron model, there is only a limited amount of complexity that you can model between the input and the output. This complexity is limited by the fact that two variables interact via the weighted sum and then via the sigmoid function. One way to overcome this limitation is to add one or more **hidden layers** of neurons between the input and output layers.\n"

]

},

{

"cell_type": "markdown",

"id": "f3be17ee",

...

...

@@ -958,7 +969,7 @@

}

},

"source": [

"In the perceptron model, there is only a limited amount of complexity that you can model between the input and the output. This complexity is limited by the fact that two variables interact via the weighted sum and then via the sigmoid function. One way to overcome this limitation is to add one or more **hidden layers** of neurons between the input and output layers.\n",

"The reason to add these layers is break down the problem into multiple small task: for the digit recognition that could be \"pick and edge\", \"find a strait line\".\n",