"Along with this small train data set, let's build a testing data set according to the same distribution. Note that we build a test data set is unrealistically big compared to the training data set. This is done on purpose to have converged validation metrics.\n",

"Along with this small training data set, let's build a testing data set according to the same distribution. Note that we build a testing data set is unrealistically big compared to the training data set. This is done on purpose to have converged validation metrics.\n",

"\n",

"Below is a plot of the training data in blue and testing data in red."

]

...

...

@@ -243,7 +243,7 @@

"source": [

"### Early stopping\n",

"\n",

"Let's use a plain vanilla neural network with 1 hidden layer of 20 neurons, tanh activation function for the hidden layer and linear output units. The loss function is the mean squared error and we choose a tolerance of $10^{-7}$ (variation of the cost function between two trainings)."

"Let's use a classic fully connected neural network with 1 hidden layer of 20 neurons, tanh activation function for the hidden layer and linear output units. The loss function is the mean squared error and we choose a tolerance of $10^{-7}$ (variation of the cost function between two trainings)."

]

},

{

...

...

@@ -300,8 +300,8 @@

}

},

"source": [

"> - Compute the mean squared error of the prediction for the train and test data set.\n",

"> - Given all the information that you have on the way we built the data set, what value do you expect for the prediction error for the best model you can hope for?\n",

"> - Compute the mean squared error of the prediction for the training and testing data set.\n",

"> - Given all the information that you have, what value do you expect for the prediction error for the best model you can hope for?\n",

"> - Comment your results."

]

},

...

...

@@ -333,7 +333,7 @@

"\n",

"> - Re-initialize the network with the same parameters as before\n",

"> - write a loop and store the train and test mean error at each optimization step.\n",

"> - Plot the mean squared error for both the training and testing data set."

"> - Plot the mean squared error as a function of the iteration number for both the training and testing data set."

]

},

{

...

...

@@ -565,7 +565,7 @@

}

},

"source": [

"We conclude that if we stop training after $n$ iterations, the component of $\\mathbf w$ which will be significantly different from zero are the component that are directed in the direction of the maximum variance of $\\mathbf H$. This result is similar to the ridge."

"We conclude that if we stop training after $n$ iterations, the component of $\\mathbf w$ which will be significantly different from zero are the component that are directed in the direction of the maximum variance of $\\mathbf H$. This result is similar to the ridge: the two methods are equivalent."

]

},

{

...

...

@@ -654,7 +654,7 @@

"source": [

"### Work with small networks\n",

"\n",

"With these two techniques, we reduce the size of the coefficients for some neurons. For our simple network that has only one single hidden layer, it means that some neurons will actually be completely useless. A strategy to get the same result would be to train a smaller network."

"With the ridge and LASSO techniques, we reduce the size of the coefficients for some neurons. For our simple network that has only one single hidden layer, it means that some neurons will actually be completely useless. A strategy to get the same result would be to train a smaller network."

]

},

{

...

...

@@ -668,7 +668,7 @@

"source": [

"Le's first analyze the magnitude of the weights in your best neural network with L2 regularization.\n",

"\n",

"> - Plot the magnitude of the weights that go in the hidden layer and the weights of the output layer.\n",

"> - Plot the magnitude of the weights of the hidden layer and the weights of the output layer.\n",

"> - How many coefficients are significantly different from 0? \n",

"> - This should give you an indication for how to build a lightweight network that has the same behavior as the regularized network.\n",

"> - Build this new network and comment the results.\n"

...

...

@@ -699,7 +699,7 @@

"source": [

"### Ensemble of Network\n",

"\n",

"Another strategy to regularize the solution is to train several network. If choose the different initial conditions for all networks, we are almost sure that each network will converge to a different solution. Each network may overfit the data but in a different way and so by averaging the output of all trained network, we can get a more generic (less overfitting) solution."

"Another strategy to regularize the solution is to train several networks. If choose different initial conditions for all networks, we are almost sure that each network will converge to a different solution. Each network may overfit the data but in a different way and so by averaging the output of all trained network, we can get a more generic (less overfitting) solution."

]

},

{

...

...

@@ -715,7 +715,7 @@

"\n",

"Depending on the problem we are working on, the solution of ensemble of networks may be too expensive. A similar approach is to temporarily alter the network by randomly deleting neurons. At each step of the gradient descent we remove a fraction of the neurons and rescale the other neurons so that the weighted sum remains on the same order of magnitude. \n",

"\n",

"Neuron dropout is a very popular approach and seem to really improve the regularization even if this approach does not rely on solid mathematical background. It is unfortunately not implemented in scikit-learn."

"Neuron dropout is a very popular approach and seem to really improve the regularization even if this approach does not rely on solid mathematical background. It is unfortunately not implemented in scikit-learn (see [here](10_intro_deep_learning.ipynb#Towards-more-advanced-libraries) for libraries dedicated to neural networks)."

]

},

{

...

...

@@ -729,7 +729,7 @@

"source": [

"### Increase sample size\n",

"\n",

"Of course, another obvious strategy is to train a model with enough samples. While this is not always possible to do so because we are limited by our observations, we can test the idea here.\n",

"Of course, another obvious strategy is to train a model with enough samples. While this is not always possible because we are limited by our observations, we can test the idea here for our synthetic data set.\n",

"\n",

"> - Increase the size of the training data set while keeping 20 neurons in the hidden layer. When the training data set is big enough, do you still overfit the data?"

]

...

...

@@ -755,7 +755,7 @@

"source": [

"## Deep networks\n",

"\n",

"The universal approximation theorem guarantees that we can approximate any smooth function with one single hidden layer provided that this layer holds enough neurons. At the same time, the idea of adding extra layer was driven by our intuition that a network would naturally segment the problem in task and subtask (for image recognition, that could be edge recognition, pattern recognition, etc.). This strategy is also inspired by the way we build computer softwares: we first build module to handle elementary operations and then build on top of these modules more elaborate functions. In the end we only use *high level* computer program with a cascade of layers of programs behind.\n"

"The universal approximation theorem guarantees that we can approximate any smooth function with one single hidden layer provided that this layer holds enough neurons. At the same time, the idea of adding extra layer was driven by our intuition that a network would naturally segment the problem in task and subtask (for image recognition, that could be edge recognition, pattern recognition, etc.). This strategy is also inspired by the way we build computer softwares: we first build modules to handle elementary operations and then build on top of these modules more elaborate functions. In the end we only use *high level* computer program with a cascade of layers of programs behind.\n"

]

},

{

...

...

@@ -783,16 +783,14 @@

"\n",

"Convolution is the main operation that we perform to apply a filter in image processing.\n",

"\n",

"Since we only work with discrete convolution, let me skip the mathematical definition of convolution and write instead its discrete 2D form:\n",

"\n",

"The discrete expression of a 2D convolution is given by\n",

"Since we only work with discrete convolution, let me skip the mathematical definition of convolution and write instead its discrete 2D form: the discrete expression of a 2D convolution is given by\n",

"where $g$ is the filtered image, $f$ is the original image, $\\omega$ is the filter kernel. Notice that convolution is commutative: $f*\\omega$ = $\\omega*f$. However in practice one of the two function has only zero on a small area compared to the other image (at least in image filtering). We call the small function the kernel."

"where $g$ is the filtered image, $f$ is the original image, $\\omega$ is the filter kernel. Notice that convolution is commutative: $f*\\omega$ = $\\omega*f$. However in practice one of the two function is only non-zero on a small area compared to the other image (at least in image filtering). We call this function the kernel."

]

},

{

...

...

@@ -816,7 +814,7 @@

"source": [

"#### Application of convolution to blur\n",

"\n",

"Below is an image of the Deep water horizon oil spill. There are many information in this picture and we may be interested in knowing the position of the oil spill or blurring the position of the boats. Let's see how we can handle that with convolution.\n"

"Below is an image of the Deep Water Horizon oil spill. There are many information in this picture and we may be interested in knowing the position of the oil spill, finding how many boats are present in this figure, etc. Let's see how we can extract information from this image with convolutions.\n"

]

},

{

...

...

@@ -879,7 +877,7 @@

"id": "170bcba6",

"metadata": {},

"source": [

"> - Define a kernel matrix of size $(3\\times3)$ and apply a convolution to the image above with `ndimage.convolve`\n",

"> - Define a kernel matrix of size $(3\\times3)$ filled with ones and apply a convolution to the image above with `ndimage.convolve`\n",

"> - Plot the convolution of the original image with this kernel and comment.\n",

"> - What happens if your kernel is bigger?"

]

...

...

@@ -933,7 +931,7 @@

}

},

"source": [

"If you recall that the Laplace operator is a second order derivative, it will highlight small-scale structures. So we can use this type of convolution as an edge detection technique.\n",

"If you recall that the Laplace operator is a second order derivative, it will highlight small-scale structures in an image. We can use this type of convolution as an edge detection technique.\n",

"\n",

"> - Try it on the image above."

]

...

...

@@ -957,7 +955,7 @@

"id": "aac51872",

"metadata": {},

"source": [

"So the convolution operator can help us treat images in the exact same way as the filters in Photoshop. We can build kernels to detect edges, we can smooth part of an image, we can sharpen an edge, etc.\n"

"So the convolution operator can help us treat images in the exact same way as filters in Photoshop. We can build kernels to detect edges, we can smooth part of an image, we can sharpen an edge, etc.\n"

]

},

{

...

...

@@ -969,7 +967,7 @@

}

},

"source": [

"### Convolutional neural networks\n",

"### Convolutional neural networks (CNN)\n",

"\n",

"The idea is then to replace the weighted sum in the activation function by the convolution. With this approach, we take advantage of the 2d structure of the image. Indeed for fully connected networks, each neuron is completely independent of its neighbors. In a convolutional neural network (CNN) this is no longer the case because of the nature of the convolution."

]

...

...

@@ -983,9 +981,9 @@

}

},

"source": [

"In CNN the kernel plays a similar role as the weights. The key aspect of CNN is that when we do a convolution, we keep the same kernel (or weights) for the entire image. This greatly diminishes the number of parameters in our neural network.\n",

"In CNNs the kernel plays a similar role as the weights. The key aspect of CNN is that when we do a convolution, we keep the same kernel (or weights) for the entire image. This greatly diminishes the number of parameters in our neural network.\n",

"\n",

"For instance if the kernel is of size $3\\times3$, we only have 9 parameters for that layer. This makes a huge difference between fully connected network and convolutional network. This difference is what opened the route to deep learning."

"For instance if the kernel is of size $3\\times3$, we only have 9 parameters for that layer. This makes a huge difference between fully connected network and convolutional network. This difference is what opened the route to stacking many layers in a network and this is what we call today *deep learning*."

]

},

{

...

...

@@ -997,20 +995,11 @@

}

},

"source": [

"In practice, each convolution is usually followed by a shrinking or *Pooling* where we group several nearby cells and take either the maximum, or mean of these cells."

]

},

{

"cell_type": "markdown",

"id": "5031a9e4",

"metadata": {

"slideshow": {

"slide_type": "subslide"

}

},

"source": [

"At a given level, we may perform several convolutions, each time with a different kernel. The idea is to isolate different features of the input image by applying different filters. \n",

"#### Common practices\n",

"\n",

"In practice, each convolution is usually followed by a shrinking or *Pooling* where we group several nearby cells and take either the maximum, or mean of these cells.\n",

"\n",

"At a given level, we may perform several convolutions, each time with a different kernel. The idea is to isolate different features of the input image by applying different filters. \n",

"\n",

"Usually, the last layers of CNN are traditional fully connected layers. Below are two illustration of classical network architectures."

]

...

...

@@ -1024,7 +1013,7 @@

}

},

"source": [

"Some architectures are known to perform well on images such as [AlexNet](https://en.wikipedia.org/wiki/AlexNet) or [VGG](https://en.wikipedia.org/wiki/File:VGG_neural_network.png). These are still empirical architecture and there is no formal proof that they should outperform all other models.\n",

"Some architectures are known to perform well on images such as [AlexNet](https://en.wikipedia.org/wiki/AlexNet) or [VGG](https://en.wikipedia.org/wiki/File:VGG_neural_network.png). These are still empirical architecture and there is no formal proof that they should outperform other models.\n",

"The network below is a line of $L$ neurons connected between the input and the output (there are $L-2$ hidden layers with 1 neuron per layer). This network cannot predict much but it will help us understand why it is hard to train a deep networks.\n",

"The network below is a line of $L$ neurons connected between the input and the output (there are $L-2$ hidden layers with $1$ neuron per layer). This network cannot predict much but it will help us understand why it is hard to train a deep networks.\n",

"\n",

"> - Read the code below and train this network with $L=5$."

]

...

...

@@ -1158,7 +1147,7 @@

"source": [

"> - Derive an analytical expression for the gradient of the cost function with respect to all weights (you can use the recurrence relation that we derived in the [previous notebook](9_neural_networks.ipynb) or you can use the recurrence relation in the code above)\n",

"> - What is the maximum value of the derivative of the sigmoid?\n",

"> - If we initialize the weights with random numbers between 0 and 1, can you characterize the magnitude of the gradient as you go from the input layer to the output layer?\n",

"> - If we initialize the weights with random numbers between 0 and 1, can you characterize the magnitude of the gradient as you go from the input layer to the output layer with the help of the recurrence relation?\n",

"\n",

"This problem is called the *vanishing gradient problem*."

]

...

...

@@ -1172,7 +1161,7 @@

}

},

"source": [

"As you observed, the weights of a deep network will evolve much slower for a layer that is near the input layer. You either need to train your model for long enough or use more elaborate methods than simple gradient descent. Another possibility is to use the ReLU activation function instead of the sigmoid.\n",

"As you observed, the weights of a deep network will evolve much slower for a layer that is near the input layer during the training phase. You either need to train your model for long enough or use more elaborate methods than simple gradient descent. Another possibility is to use the ReLU activation function instead of the sigmoid.\n",

"\n",

"> - Can you explain why the ReLU activation function does not suffer the vanishing gradient problem?\n",