diff --git a/exercises/.gitignore b/exercises/.gitignore
index faa6d1dd7ab2cd676655c359c0eecb34914e8b4a..176ae11044fc05bc4479925d77303895ab49136a 100644
--- a/exercises/.gitignore
+++ b/exercises/.gitignore
@@ -1,3 +1,4 @@
 target/
 output.h5
+pics/
 perf.data*
diff --git a/handouts/src/22-shady-business.md b/handouts/src/22-shady-business.md
index 946ca2bf6a77868ec8baae311357e9e36a05be2f..0cf1f7e4316c072089eba6bd33f88cd849b0c73f 100644
--- a/handouts/src/22-shady-business.md
+++ b/handouts/src/22-shady-business.md
@@ -269,9 +269,9 @@ mod shader {
 }
 ```
 
-Finally, to make the code more maintainable, we give human-readable names to the
-shader binding points. These constants will still need to be kept up to date if
-the associated GLSL interface evolves over time, but at least we won't need to
+To make the code more maintainable, we give human-readable names to the shader
+binding points. These constants will still need to be kept up to date if the
+associated GLSL interface evolves over time, but at least we won't need to
 update magic numbers spread all over the code:
 
 ```rust
diff --git a/handouts/src/23-gazoduc.md b/handouts/src/23-gazoduc.md
index f8005bdf814cf1eb618e6ec3560019c04d8df7a8..9cbc27fa8a7c43e0c9713eb3348093299023e070 100644
--- a/handouts/src/23-gazoduc.md
+++ b/handouts/src/23-gazoduc.md
@@ -6,10 +6,10 @@ CPU-provided input and output resources, so before a shader can run, CPU-side
 steering code must first _bind_ input and output resources to the shader's data
 interface.
 
-In an attempt to simplify the life of larger GPU applications, GPU APIs like to
-make this resource binding process very flexible. But there is a tradeoff there.
-If the Vulkan spec just kept adding more configuration options at resource
-binding time, it would quickly cause two problems:
+In an attempt to simplify the life of larger applications, GPU APIs like to make
+this resource binding process very flexible. But there is a tradeoff there. If
+the Vulkan spec just kept adding more configuration options at resource binding
+time, it would quickly cause two problems:
 
 1. Resource binding calls, which are on the application's performance-critical
    code path where GPU shaders can be executed many times in rapid succession,
@@ -106,8 +106,9 @@ We have not introduced push constants before. They are a way to quickly pass
 small amounts of information (hardware-dependent, at least 128 bytes) to a
 pipeline  by directly storing it inline within the GPU command that starts the
 pipeline's execution. We will not need them in this Gray-Scott reaction
-simulation because a simulation step needs no input other than the input image,
-which itself is way too big to fit inside of a push constant.
+simulation because there are no simulation inputs that vary on each simulation
+step other than the input image, which itself is way too big to fit inside of a
+push constant.
 
 Therefore, the only thing that we actually need to pay attention to in this
 pipeline layout configuration is the `set_layouts` descriptor set configuration.
diff --git a/handouts/src/24-resourceful.md b/handouts/src/24-resourceful.md
index b48b9419554ebbbde3446c21017b628efbf1165d..5bc25d26485a539fdbbc3741618c03b330c5b7da 100644
--- a/handouts/src/24-resourceful.md
+++ b/handouts/src/24-resourceful.md
@@ -618,6 +618,8 @@ let execute_future = command_buffer.execute(context.queue.clone())?;
 are done executing via a synchronization primitive called a fence[^4].
 
 ```rust,ignore
+use vulkano::sync::GpuFuture;
+
 let fence_future = execute_future.then_signal_fence_and_flush()?;
 ```
 
@@ -740,14 +742,15 @@ keep the objects that we use directly:
 
 - Descriptor sets are used to specify our compute pipeline's concentration
   inputs and outputs.
-- One of the concentration buffers and the matching image will be needed again
-  when downloading the concentration of V from the GPU for the purpose of saving
-  output data to HDF5 files.
+- The V species images are used, together with a matching buffer, to download
+  the concentration of V from the GPU for the purpose of saving output data to
+  HDF5 files.
 
 We will bundle this state together using a double buffering abstraction similar
-to the one that we have used previously in the CPU code, except retaining
-API compatibility with the CPU code will require a fair bit more black magic
-under the hood.
+to the one that we have used previously in the CPU code, trying to retain as
+much API compatibility as we can to ease the migration. But as you will see,
+this comes at a significant cost, so we may want to revisit this decision later
+on, after we get the code in a runnable and testable state.
 
 ```rust,ignore
 use crate::context::CommandBufferAllocator;
@@ -765,10 +768,12 @@ pub struct Concentrations {
 
     // Area where we download the concentration of the V species
     v_buffer: Subbuffer<[Float]>,
-    // NOTE: It is possible to do without this copy, but harder than you think
+    // Unsatisfying copy of the contents of `v_buffer`, present for API
+    // compatibility reasons.
     v_ndarray: Array2<Float>,
 
-    // Subset of the GPU context that is retained for CPU API compatibility
+    // Subset of the GPU context that we need to keep around
+    // in order to provide CPU API compatibility
     command_alloc: Arc<CommandBufferAllocator>,
     queue: Arc<Queue>,
 }
@@ -833,7 +838,7 @@ impl Concentrations {
     /// Run a simulation step
     ///
     /// The user callback function `step` will be called with the proper
-    /// descriptor set for executing the GPU compute pipeline.
+    /// descriptor set for executing the GPU compute
     pub fn update(
         &mut self,
         step: impl FnOnce(Arc<PersistentDescriptorSet>) -> Result<(), Box<dyn Error>>,
@@ -845,9 +850,32 @@ impl Concentrations {
 }
 ```
 
-
-TODO: Finish with common struct
-
+Some sacrifices were made to keep the API similar to the CPU version in this
+first GPU version:
+
+- The `Concentrations` struct needs to store a fair bit more state than its
+  GPU version, including state that is only remotely related to its purpose
+  like some elements of the Vulkan context.
+- `current_v()` needs to build, submit, flush and await a command buffer for a
+  single download command, which is not ideal in terms of CPU/GPU communication
+  efficiency.
+- `current_v()` needs to copy the freshly downloaded GPU data into a separate
+  `Array2` in order to keep the same API return type.
+- Although `update()` was adapted to propagate GPU errors (which is necessary
+  because almost every Vulkan function can error out), its API design was not
+  otherwise modified to accomodate the asynchronous command buffer based
+  workflow of Vulkan. As a result, we will need to "work around the API" through
+  quite inelegant code later on.
+
+Given these compromises, the minimal set of API changes needed are that...
+
+- `new()` has completely different parameters because it needs access to all the
+  other GPU state, and we definitely don't want to make `Concentrations`
+  responsible for managing that state too.
+- Both `current_v()` and `update()` must be adapted to handle the fact that GPU
+  APIs can error out a lot more often than CPU APIs.[^6]
+- The `update()` callback needs a different signature because GPU code
+  manipulates inputs and outputs very differently with respect to CPU code.
 
 
 ## Exercise
@@ -902,3 +930,7 @@ If you are going fast and want a more challenging exercise...
       interacting with the GPU, the atomic increments of a few `Arc` clones here
       and there are not normally a significant cost. So I think that for a GPU
       API binding, `vulkano`'s `Arc`-everywhere design is a reasonable choice.
+
+[^6]: We could adjust the CPU API to keep them in sync here by making its
+      corresponding entry points return `Result<(), Box<dyn Error>>` too. It
+      would just happen that the error case is never hit.
diff --git a/handouts/src/25-simulating.md b/handouts/src/25-simulating.md
new file mode 100644
index 0000000000000000000000000000000000000000..19958b83e167243f61eeabf26a598bf2276f8f7c
--- /dev/null
+++ b/handouts/src/25-simulating.md
@@ -0,0 +1,325 @@
+# Execution
+
+At this point, we have a Vulkan context, we have a compute pipeline, and we have
+descriptor sets that we can use to configure the pipeline's inputs and outputs.
+Now all we need to do is put them together, and we will finally have a complete
+GPU simulation that we can test and optimize. Getting to this major milestone
+will be the goal of this chapter.
+
+
+## Work-groups
+
+Back when we introduced this simulation's GLSL shader, we mentioned in passing
+that work-groups are the Vulkan (more generally, Khronos) equivalent of NVidia
+CUDA's thread blocks. As part of executing our compute pipeline, we will meet
+them again, so let's discuss them a little more.
+
+### How work-groups came to be
+
+Many decades ago, when GPU APIs started to expose programmable shaders to
+developers, they made one very important API design choice: to abstract away the
+many forms of hardware concurrency and parallelism (SIMD, VLIW/superscalar,
+hardware threading, multicore...) behind a single unifying "data-parallel for
+loop" interface.
+
+In this grand unified design, the developer would write their program in terms
+of how a single visual component[^1] of the 3D scene is processed, then specify
+on the CPU side what is the set of visual components that the scene is made of,
+and the GPU driver and hardware would then be in charge of ensuring that all
+specified visual components are eventually processed by the developer-specified,
+possibly in parallel across multiple execution units.
+
+To give the implementation maximal freedom, GPU APIs only exposed a minimal
+ability for these shader executions to communicate with each other. Basically,
+the only API-sanctioned way was to run multiple GPU jobs in a sequence, using
+outputs from job N to adjust the configuration of job N+1 from the CPU side.
+Anything else was a non-portable hack that required very good knowledge of the
+underlying hardware and how the GPU driver would map the GPU API to this
+hardware.
+
+This model has served the graphics community very well, enabling GPU programs to
+achieve good compatibility across many different kinds of hardware, and to scale
+to very high degrees of hardware parallelism due to the small amount of
+communication that the API enforces. But as the numerical computing community
+started abusing GPUs for scientific computations, the lack of a fast
+communication primitive quickly emerged as a problem that warranted a better
+solution.
+
+### Work-groups today
+
+Thus came work-groups, which introduced some hierarchy to the execution domain:
+
+- The 1/2/3D range of indices over which the compute work is distributed is now
+  sliced into contiguous blocks of uniform size, with blocks of size [1, 1, 1]
+  roughly matching the API semantics of the former configuration.[^2]
+- Each work-item within a work-group is guaranteed to be executed concurrently
+  on the same GPU compute unit, enabling fast communication through hardware 
+  channels like the compute unit's L1 cache (aka shared/local memory) and
+  group-local execution and memory barriers. In contrast, work-items from
+  different work-groups may still only communicate with each other through very
+  clever hacks of dubious portability, as before.
+
+Since that time, further performance concerns have led to the exposure of even 
+more hardware concepts in the GPU APIs[^3], further complexifying the execution
+model for those programs that want to use optimally efficient communication
+patterns. But work-groups remain to this day the only part of the GPU execution
+model that one **needs** to care about order to be able to use a GPU for
+computation at all. Even if your computation is purely data-parallel, like our
+first Gray-Scott reaction simulation, you will still need to set a work-group
+size in order to be able to run it.
+
+### Picking a work-group size
+
+So what work-group size should you pick when you don't care because your code is
+not communicating? Well, the answer depends on several factors:
+
+- Work-groups should be at least as large as the GPU's SIMD vector width,
+  otherwise execution performance will drop by a corresponding factor. More
+  generally, their size should be a multiple of the GPU's SIMD vector width.
+- Beyond that, if you do not need to communicate or otherwise leverage shared
+  resources, smaller work groups are often better. They allow the GPU to
+  distribute the work across more compute units, improve load balancing, and
+  reduce pressure on shared compute unit resources like registers and the L1
+  cache.
+- If you do use the shared resources, then you can try to increase the
+  work-group size until either the trade-off becomes unfavorable and speed goes
+  down or you reach the hardware limit. But that is generally a
+  hardware-specific empirical tuning process.
+
+Knowing that [all GPU hardware in common use has a SIMD vector width that is a
+power of two smaller than
+64](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.3&name=minSubgroupSize&platform=all),
+a work-group of 64 work-items sounds like a reasonable start. We may then
+fine-tune later on by increasing work-group size in increments of 64, to see how
+it affects performance.
+
+Moving on, in 2D and 3D problems, there is also a question of work-group shape.
+Should work-groups be square/cubic? Rectangular? Horizontal or vertical lines?
+Here again, there is a trade-off:
+
+- Hardware loves long streams of contiguous memory accesses. So whenever
+  possible, work-group shape should follow the direction of contiguous data in
+  memory and be as elongated as possible in this direction.
+- But in stencil problems like ours, spatial locality also matters: it's good if
+  the data that is loaded by one work-item can be reused by another work-item
+  within the same work-group, as the two work-items will then be able to
+  leverage the fact that they share an L1 cache for improved memory access
+  performance.
+
+Here we are using images, whose memory layout is unspecified but optimized for
+2D locality. And we obviously care about spatial locality for the purpose of
+loading our input data. Therefore, a square work-group sounds like a good
+starting point. And all in all, that is why we have initially set the GLSL
+work-group size to 8x8.
+
+
+## A single update step
+
+For the purpose of making the CPU-to-GPU transition faster and easier, we will
+again mimick the design of the CPU API in the GPU API, even when it means doing
+some sub-optimal things in the first version. We will therefore again have a
+function that performs one GPU compute step, except this time said function will
+have this signature:
+
+```rust,ignore
+use crate::{
+    gpu::context::VulkanContext,
+    options::RunnerOptions,
+};
+use std::{error::Error, sync::Arc};
+use vulkano::{
+    descriptor_set::PersistentDescriptorSet,
+    pipeline::ComputePipeline,
+};
+
+/// Run a simulation step on the GPU
+pub fn update(
+    context: &VulkanContext,
+    options: &RunnerOptions,
+    pipeline: Arc<ComputePipeline>,
+    parameters: Arc<PersistentDescriptorSet>,
+    concentrations: Arc<PersistentDescriptorSet>,
+) -> Result<(), Box<dyn Error>> {
+    // TODO: Actually do the work
+}
+```
+
+Notice that in addition to the three parameters that you most likely expected
+(the compute pipeline and the two resource descriptor sets that will be bound to
+it), we will also need a Vulkan context to submit commands, and the simulation
+runner options to tell how large the simulation domain is. And because this is
+GPU code, we also need to account for the possibility of API errors.
+
+Inside of this function, we will first translate our simulation domain shape
+into a GPU dispatch size. This means that we will need to check that the
+simulation domain is composed of an integer amount of work-groups (a restriction
+of this version of the simulation which would be hard to work around[^4]), and
+then translate the simulation domain shape into a number of work-groups that the
+GPU should run across each spatial dimension.
+
+```rust,ignore
+use crate::gpu::pipeline::WORK_GROUP_SHAPE;
+
+assert!(
+    options.num_rows.max(options.num_cols) < u32::MAX as usize,
+    "Simulation domain has too many rows/columns for a GPU"
+);
+let num_rows = options.num_rows as u32;
+let num_cols = options.num_cols as u32;
+let work_group_cols = WORK_GROUP_SHAPE[0];
+let work_group_rows = WORK_GROUP_SHAPE[1];
+assert!(
+    (num_cols % work_group_cols).max(num_rows % work_group_rows) == 0,
+    "Simulation domain size must be a multiple of GPU work-group size"
+);
+let dispatch_size = [num_cols / work_group_cols, num_rows / work_group_rows, 1];
+```
+
+Notice how as previously mentioned, we must be careful when mixing
+simulation domain shapes (where dimensions are normally specified in `[rows,
+columns]` order) and GPU 2D spaces (where dimensions are normally specified in
+`[width, height]` order).
+
+After this, we'll start to record a command buffer, and also grab a copy of our
+compute pipeline's I/O layout for reasons that will become clear in the next
+step.
+
+```rust,ignore
+use vulkano::{
+    command_buffer::{AutoCommandBufferBuilder, CommandBufferUsage},
+    pipeline::Pipeline,
+};
+
+let mut builder = AutoCommandBufferBuilder::primary(
+    &context.command_alloc,
+    context.queue.queue_family_index(),
+    CommandBufferUsage::OneTimeSubmit,
+)?;
+let pipeline_layout = pipeline.layout().clone();
+```
+
+We will then bind our compute pipeline (which moves it away, hence the previous
+layout copy) and its parameters (which requires a copy of the pipeline layout).
+
+```rust,ignore
+use crate::gpu::pipeline::PARAMS_SET;
+use vulkano::pipeline::PipelineBindPoint;
+
+builder
+    .bind_pipeline_compute(pipeline)?
+    .bind_descriptor_sets(
+        PipelineBindPoint::Compute,
+        pipeline_layout.clone(),
+        PARAMS_SET,
+        parameters,
+    )?;
+```
+
+And then we will bind the current simulation inputs and outputs and schedule
+an execution of the compute pipeline with the previously computed number of
+work-groups...
+
+```rust,ignore
+use crate::gpu::pipeline::IMAGES_SET;
+
+builder
+    .bind_descriptor_sets(
+        PipelineBindPoint::Compute,
+        pipeline_layout,
+        IMAGES_SET,
+        concentrations,
+    )?
+    .dispatch(dispatch_size)?;
+```
+
+Finally, we will build the command buffer, submit it to the command queue,
+flush the command queue to the GPU, and wait for execution to finish.
+
+```rust,ignore
+use vulkano::{
+    command_buffer::PrimaryCommandBufferAbstract,
+    sync::GpuFuture,
+};
+
+builder
+    .build()?
+    .execute(context.queue.clone())?
+    .then_signal_fence_and_flush()?
+    .wait(None)?;
+```
+
+With this, we get an `update()` function whose semantics roughly match those of
+the CPU version, and will be thus easiest to integrate into the existing code.
+But as you may have guessed while reading the previous code, some sacrifices had
+to be made to achieve this:
+
+- We are not leveraging Vulkan's command batching capabilities. On every
+  simulation step, we need to build a new command buffer that binds the
+  simulation compute pipeline and parameters set, only to execute a single
+  compute dispatch and wait for it. These overheads are especially bad because
+  they are paid once per simulation step, so scaling up the number of compute
+  steps per saved image will not amortize them.
+- Compared to an optimized alternative that would just bind the compute
+  descriptor set and schedule a compute pipeline dispatch, leaving the rest up
+  to higher-level code, our `update()` function needs to handle many unrelated
+  concerns, and gets a more complex signature and complicated calling procedure
+  as a result.
+
+Therefore, as soon as we get the simulation to run, this should be our first
+target for optimization.
+
+
+## Exercise
+
+At long last, we are ready to roll out a full Vulkan-based GPU simulation.
+
+Integrate the simulation update function discussed above at the root of the
+`gpu` module of the codebase, then write a new version of `run_simulation()` in
+the same module that uses all the infrastructure introduced so far to run the
+simulation on the GPU.
+
+Next, make the simulation use the gpu version of `run_simulation()`. To enable
+CPU/GPU comparisons, it is a good idea to to keep it easy to run the CPU
+version. For example, you could have an off-by-default `cpu: bool` runner option
+which lets you switch back to CPU mode.
+
+Once the simulation builds, make sure that it runs successfully and without
+significant[^5] Vulkan validation warnings. You can check the latter by running
+the simulation binary in debug mode (`cargo run --bin simulate` without the
+usual `--release` option). Beware that such a debug binary can be slow: you will
+want to run it with a smaller number of simulation steps.
+
+And finally, once the simulation binary seems to run successfully, you will
+adjust the microbenchmark to evaluate its performance:
+
+- Add a new GPU update microbenchmark in addition to the existing CPU
+  microbenchmark, so that you can measure how the two implementations compare.
+- Add another GPU microbenchmark that measures the performance of getting the
+  data from the GPU to the CPU (via `Concentrations::current_v()`).
+
+Then run the resulting microbenchmark, and compare the performance
+characteristics of our first GPU version to the optimized CPU version.
+
+
+---
+
+[^1]: I'm purposely being vague about what a "visual component" is, since even
+      back in the day there were already two kinds of components with
+      configurable processing (triangle vertex and pixel/fragment), and the list
+      has grown enormously since: geometry primitives, tesselator inputs and
+      outputs, rays, meshlets...
+
+[^2]: ...albeit with bad performance because SIMD can't help leaking through
+      every nice abstraction and making the life of everyone more complicated.
+
+[^3]: First came sub-groups/warps, which expose hardware SIMD instructions. Then
+      came NVidia's thread-block clusters, which to my knowledge do not have a
+      standardized Khronos name yet, but are basically about modeling L2 caches
+      shards shared by multiple GPU compute units.
+
+[^4]: It could be done with a simple `if` in the current version, but would
+      become much harder when introducing later optimizations that leverage
+      the communication capabilities of work-groups and subgroups.
+
+[^5]: As a minor spoiler, you will find that even `vulkano` gets some subtleties
+      of Vulkan wrong.
diff --git a/handouts/src/SUMMARY.md b/handouts/src/SUMMARY.md
index 28ec4f8db37d4f41ddf8aa75ea822dfe6fa20640..07a928f860678dbfddbb5c88703f2ab372c9a513 100644
--- a/handouts/src/SUMMARY.md
+++ b/handouts/src/SUMMARY.md
@@ -34,9 +34,8 @@
 - [Shader](22-shady-business.md)
 - [Pipeline](23-gazoduc.md)
 - [Resources](24-resourceful.md)
-- [First run]()
-- [Writing output]()
-- [Optimizing]()
+- [Execution](25-simulating.md)
+- [Optimizations]()
 
 ---