diff --git a/exercises/.gitignore b/exercises/.gitignore index faa6d1dd7ab2cd676655c359c0eecb34914e8b4a..176ae11044fc05bc4479925d77303895ab49136a 100644 --- a/exercises/.gitignore +++ b/exercises/.gitignore @@ -1,3 +1,4 @@ target/ output.h5 +pics/ perf.data* diff --git a/handouts/src/22-shady-business.md b/handouts/src/22-shady-business.md index 946ca2bf6a77868ec8baae311357e9e36a05be2f..0cf1f7e4316c072089eba6bd33f88cd849b0c73f 100644 --- a/handouts/src/22-shady-business.md +++ b/handouts/src/22-shady-business.md @@ -269,9 +269,9 @@ mod shader { } ``` -Finally, to make the code more maintainable, we give human-readable names to the -shader binding points. These constants will still need to be kept up to date if -the associated GLSL interface evolves over time, but at least we won't need to +To make the code more maintainable, we give human-readable names to the shader +binding points. These constants will still need to be kept up to date if the +associated GLSL interface evolves over time, but at least we won't need to update magic numbers spread all over the code: ```rust diff --git a/handouts/src/23-gazoduc.md b/handouts/src/23-gazoduc.md index f8005bdf814cf1eb618e6ec3560019c04d8df7a8..9cbc27fa8a7c43e0c9713eb3348093299023e070 100644 --- a/handouts/src/23-gazoduc.md +++ b/handouts/src/23-gazoduc.md @@ -6,10 +6,10 @@ CPU-provided input and output resources, so before a shader can run, CPU-side steering code must first _bind_ input and output resources to the shader's data interface. -In an attempt to simplify the life of larger GPU applications, GPU APIs like to -make this resource binding process very flexible. But there is a tradeoff there. -If the Vulkan spec just kept adding more configuration options at resource -binding time, it would quickly cause two problems: +In an attempt to simplify the life of larger applications, GPU APIs like to make +this resource binding process very flexible. But there is a tradeoff there. If +the Vulkan spec just kept adding more configuration options at resource binding +time, it would quickly cause two problems: 1. Resource binding calls, which are on the application's performance-critical code path where GPU shaders can be executed many times in rapid succession, @@ -106,8 +106,9 @@ We have not introduced push constants before. They are a way to quickly pass small amounts of information (hardware-dependent, at least 128 bytes) to a pipeline by directly storing it inline within the GPU command that starts the pipeline's execution. We will not need them in this Gray-Scott reaction -simulation because a simulation step needs no input other than the input image, -which itself is way too big to fit inside of a push constant. +simulation because there are no simulation inputs that vary on each simulation +step other than the input image, which itself is way too big to fit inside of a +push constant. Therefore, the only thing that we actually need to pay attention to in this pipeline layout configuration is the `set_layouts` descriptor set configuration. diff --git a/handouts/src/24-resourceful.md b/handouts/src/24-resourceful.md index b48b9419554ebbbde3446c21017b628efbf1165d..5bc25d26485a539fdbbc3741618c03b330c5b7da 100644 --- a/handouts/src/24-resourceful.md +++ b/handouts/src/24-resourceful.md @@ -618,6 +618,8 @@ let execute_future = command_buffer.execute(context.queue.clone())?; are done executing via a synchronization primitive called a fence[^4]. ```rust,ignore +use vulkano::sync::GpuFuture; + let fence_future = execute_future.then_signal_fence_and_flush()?; ``` @@ -740,14 +742,15 @@ keep the objects that we use directly: - Descriptor sets are used to specify our compute pipeline's concentration inputs and outputs. -- One of the concentration buffers and the matching image will be needed again - when downloading the concentration of V from the GPU for the purpose of saving - output data to HDF5 files. +- The V species images are used, together with a matching buffer, to download + the concentration of V from the GPU for the purpose of saving output data to + HDF5 files. We will bundle this state together using a double buffering abstraction similar -to the one that we have used previously in the CPU code, except retaining -API compatibility with the CPU code will require a fair bit more black magic -under the hood. +to the one that we have used previously in the CPU code, trying to retain as +much API compatibility as we can to ease the migration. But as you will see, +this comes at a significant cost, so we may want to revisit this decision later +on, after we get the code in a runnable and testable state. ```rust,ignore use crate::context::CommandBufferAllocator; @@ -765,10 +768,12 @@ pub struct Concentrations { // Area where we download the concentration of the V species v_buffer: Subbuffer<[Float]>, - // NOTE: It is possible to do without this copy, but harder than you think + // Unsatisfying copy of the contents of `v_buffer`, present for API + // compatibility reasons. v_ndarray: Array2<Float>, - // Subset of the GPU context that is retained for CPU API compatibility + // Subset of the GPU context that we need to keep around + // in order to provide CPU API compatibility command_alloc: Arc<CommandBufferAllocator>, queue: Arc<Queue>, } @@ -833,7 +838,7 @@ impl Concentrations { /// Run a simulation step /// /// The user callback function `step` will be called with the proper - /// descriptor set for executing the GPU compute pipeline. + /// descriptor set for executing the GPU compute pub fn update( &mut self, step: impl FnOnce(Arc<PersistentDescriptorSet>) -> Result<(), Box<dyn Error>>, @@ -845,9 +850,32 @@ impl Concentrations { } ``` - -TODO: Finish with common struct - +Some sacrifices were made to keep the API similar to the CPU version in this +first GPU version: + +- The `Concentrations` struct needs to store a fair bit more state than its + GPU version, including state that is only remotely related to its purpose + like some elements of the Vulkan context. +- `current_v()` needs to build, submit, flush and await a command buffer for a + single download command, which is not ideal in terms of CPU/GPU communication + efficiency. +- `current_v()` needs to copy the freshly downloaded GPU data into a separate + `Array2` in order to keep the same API return type. +- Although `update()` was adapted to propagate GPU errors (which is necessary + because almost every Vulkan function can error out), its API design was not + otherwise modified to accomodate the asynchronous command buffer based + workflow of Vulkan. As a result, we will need to "work around the API" through + quite inelegant code later on. + +Given these compromises, the minimal set of API changes needed are that... + +- `new()` has completely different parameters because it needs access to all the + other GPU state, and we definitely don't want to make `Concentrations` + responsible for managing that state too. +- Both `current_v()` and `update()` must be adapted to handle the fact that GPU + APIs can error out a lot more often than CPU APIs.[^6] +- The `update()` callback needs a different signature because GPU code + manipulates inputs and outputs very differently with respect to CPU code. ## Exercise @@ -902,3 +930,7 @@ If you are going fast and want a more challenging exercise... interacting with the GPU, the atomic increments of a few `Arc` clones here and there are not normally a significant cost. So I think that for a GPU API binding, `vulkano`'s `Arc`-everywhere design is a reasonable choice. + +[^6]: We could adjust the CPU API to keep them in sync here by making its + corresponding entry points return `Result<(), Box<dyn Error>>` too. It + would just happen that the error case is never hit. diff --git a/handouts/src/25-simulating.md b/handouts/src/25-simulating.md new file mode 100644 index 0000000000000000000000000000000000000000..19958b83e167243f61eeabf26a598bf2276f8f7c --- /dev/null +++ b/handouts/src/25-simulating.md @@ -0,0 +1,325 @@ +# Execution + +At this point, we have a Vulkan context, we have a compute pipeline, and we have +descriptor sets that we can use to configure the pipeline's inputs and outputs. +Now all we need to do is put them together, and we will finally have a complete +GPU simulation that we can test and optimize. Getting to this major milestone +will be the goal of this chapter. + + +## Work-groups + +Back when we introduced this simulation's GLSL shader, we mentioned in passing +that work-groups are the Vulkan (more generally, Khronos) equivalent of NVidia +CUDA's thread blocks. As part of executing our compute pipeline, we will meet +them again, so let's discuss them a little more. + +### How work-groups came to be + +Many decades ago, when GPU APIs started to expose programmable shaders to +developers, they made one very important API design choice: to abstract away the +many forms of hardware concurrency and parallelism (SIMD, VLIW/superscalar, +hardware threading, multicore...) behind a single unifying "data-parallel for +loop" interface. + +In this grand unified design, the developer would write their program in terms +of how a single visual component[^1] of the 3D scene is processed, then specify +on the CPU side what is the set of visual components that the scene is made of, +and the GPU driver and hardware would then be in charge of ensuring that all +specified visual components are eventually processed by the developer-specified, +possibly in parallel across multiple execution units. + +To give the implementation maximal freedom, GPU APIs only exposed a minimal +ability for these shader executions to communicate with each other. Basically, +the only API-sanctioned way was to run multiple GPU jobs in a sequence, using +outputs from job N to adjust the configuration of job N+1 from the CPU side. +Anything else was a non-portable hack that required very good knowledge of the +underlying hardware and how the GPU driver would map the GPU API to this +hardware. + +This model has served the graphics community very well, enabling GPU programs to +achieve good compatibility across many different kinds of hardware, and to scale +to very high degrees of hardware parallelism due to the small amount of +communication that the API enforces. But as the numerical computing community +started abusing GPUs for scientific computations, the lack of a fast +communication primitive quickly emerged as a problem that warranted a better +solution. + +### Work-groups today + +Thus came work-groups, which introduced some hierarchy to the execution domain: + +- The 1/2/3D range of indices over which the compute work is distributed is now + sliced into contiguous blocks of uniform size, with blocks of size [1, 1, 1] + roughly matching the API semantics of the former configuration.[^2] +- Each work-item within a work-group is guaranteed to be executed concurrently + on the same GPU compute unit, enabling fast communication through hardware + channels like the compute unit's L1 cache (aka shared/local memory) and + group-local execution and memory barriers. In contrast, work-items from + different work-groups may still only communicate with each other through very + clever hacks of dubious portability, as before. + +Since that time, further performance concerns have led to the exposure of even +more hardware concepts in the GPU APIs[^3], further complexifying the execution +model for those programs that want to use optimally efficient communication +patterns. But work-groups remain to this day the only part of the GPU execution +model that one **needs** to care about order to be able to use a GPU for +computation at all. Even if your computation is purely data-parallel, like our +first Gray-Scott reaction simulation, you will still need to set a work-group +size in order to be able to run it. + +### Picking a work-group size + +So what work-group size should you pick when you don't care because your code is +not communicating? Well, the answer depends on several factors: + +- Work-groups should be at least as large as the GPU's SIMD vector width, + otherwise execution performance will drop by a corresponding factor. More + generally, their size should be a multiple of the GPU's SIMD vector width. +- Beyond that, if you do not need to communicate or otherwise leverage shared + resources, smaller work groups are often better. They allow the GPU to + distribute the work across more compute units, improve load balancing, and + reduce pressure on shared compute unit resources like registers and the L1 + cache. +- If you do use the shared resources, then you can try to increase the + work-group size until either the trade-off becomes unfavorable and speed goes + down or you reach the hardware limit. But that is generally a + hardware-specific empirical tuning process. + +Knowing that [all GPU hardware in common use has a SIMD vector width that is a +power of two smaller than +64](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.3&name=minSubgroupSize&platform=all), +a work-group of 64 work-items sounds like a reasonable start. We may then +fine-tune later on by increasing work-group size in increments of 64, to see how +it affects performance. + +Moving on, in 2D and 3D problems, there is also a question of work-group shape. +Should work-groups be square/cubic? Rectangular? Horizontal or vertical lines? +Here again, there is a trade-off: + +- Hardware loves long streams of contiguous memory accesses. So whenever + possible, work-group shape should follow the direction of contiguous data in + memory and be as elongated as possible in this direction. +- But in stencil problems like ours, spatial locality also matters: it's good if + the data that is loaded by one work-item can be reused by another work-item + within the same work-group, as the two work-items will then be able to + leverage the fact that they share an L1 cache for improved memory access + performance. + +Here we are using images, whose memory layout is unspecified but optimized for +2D locality. And we obviously care about spatial locality for the purpose of +loading our input data. Therefore, a square work-group sounds like a good +starting point. And all in all, that is why we have initially set the GLSL +work-group size to 8x8. + + +## A single update step + +For the purpose of making the CPU-to-GPU transition faster and easier, we will +again mimick the design of the CPU API in the GPU API, even when it means doing +some sub-optimal things in the first version. We will therefore again have a +function that performs one GPU compute step, except this time said function will +have this signature: + +```rust,ignore +use crate::{ + gpu::context::VulkanContext, + options::RunnerOptions, +}; +use std::{error::Error, sync::Arc}; +use vulkano::{ + descriptor_set::PersistentDescriptorSet, + pipeline::ComputePipeline, +}; + +/// Run a simulation step on the GPU +pub fn update( + context: &VulkanContext, + options: &RunnerOptions, + pipeline: Arc<ComputePipeline>, + parameters: Arc<PersistentDescriptorSet>, + concentrations: Arc<PersistentDescriptorSet>, +) -> Result<(), Box<dyn Error>> { + // TODO: Actually do the work +} +``` + +Notice that in addition to the three parameters that you most likely expected +(the compute pipeline and the two resource descriptor sets that will be bound to +it), we will also need a Vulkan context to submit commands, and the simulation +runner options to tell how large the simulation domain is. And because this is +GPU code, we also need to account for the possibility of API errors. + +Inside of this function, we will first translate our simulation domain shape +into a GPU dispatch size. This means that we will need to check that the +simulation domain is composed of an integer amount of work-groups (a restriction +of this version of the simulation which would be hard to work around[^4]), and +then translate the simulation domain shape into a number of work-groups that the +GPU should run across each spatial dimension. + +```rust,ignore +use crate::gpu::pipeline::WORK_GROUP_SHAPE; + +assert!( + options.num_rows.max(options.num_cols) < u32::MAX as usize, + "Simulation domain has too many rows/columns for a GPU" +); +let num_rows = options.num_rows as u32; +let num_cols = options.num_cols as u32; +let work_group_cols = WORK_GROUP_SHAPE[0]; +let work_group_rows = WORK_GROUP_SHAPE[1]; +assert!( + (num_cols % work_group_cols).max(num_rows % work_group_rows) == 0, + "Simulation domain size must be a multiple of GPU work-group size" +); +let dispatch_size = [num_cols / work_group_cols, num_rows / work_group_rows, 1]; +``` + +Notice how as previously mentioned, we must be careful when mixing +simulation domain shapes (where dimensions are normally specified in `[rows, +columns]` order) and GPU 2D spaces (where dimensions are normally specified in +`[width, height]` order). + +After this, we'll start to record a command buffer, and also grab a copy of our +compute pipeline's I/O layout for reasons that will become clear in the next +step. + +```rust,ignore +use vulkano::{ + command_buffer::{AutoCommandBufferBuilder, CommandBufferUsage}, + pipeline::Pipeline, +}; + +let mut builder = AutoCommandBufferBuilder::primary( + &context.command_alloc, + context.queue.queue_family_index(), + CommandBufferUsage::OneTimeSubmit, +)?; +let pipeline_layout = pipeline.layout().clone(); +``` + +We will then bind our compute pipeline (which moves it away, hence the previous +layout copy) and its parameters (which requires a copy of the pipeline layout). + +```rust,ignore +use crate::gpu::pipeline::PARAMS_SET; +use vulkano::pipeline::PipelineBindPoint; + +builder + .bind_pipeline_compute(pipeline)? + .bind_descriptor_sets( + PipelineBindPoint::Compute, + pipeline_layout.clone(), + PARAMS_SET, + parameters, + )?; +``` + +And then we will bind the current simulation inputs and outputs and schedule +an execution of the compute pipeline with the previously computed number of +work-groups... + +```rust,ignore +use crate::gpu::pipeline::IMAGES_SET; + +builder + .bind_descriptor_sets( + PipelineBindPoint::Compute, + pipeline_layout, + IMAGES_SET, + concentrations, + )? + .dispatch(dispatch_size)?; +``` + +Finally, we will build the command buffer, submit it to the command queue, +flush the command queue to the GPU, and wait for execution to finish. + +```rust,ignore +use vulkano::{ + command_buffer::PrimaryCommandBufferAbstract, + sync::GpuFuture, +}; + +builder + .build()? + .execute(context.queue.clone())? + .then_signal_fence_and_flush()? + .wait(None)?; +``` + +With this, we get an `update()` function whose semantics roughly match those of +the CPU version, and will be thus easiest to integrate into the existing code. +But as you may have guessed while reading the previous code, some sacrifices had +to be made to achieve this: + +- We are not leveraging Vulkan's command batching capabilities. On every + simulation step, we need to build a new command buffer that binds the + simulation compute pipeline and parameters set, only to execute a single + compute dispatch and wait for it. These overheads are especially bad because + they are paid once per simulation step, so scaling up the number of compute + steps per saved image will not amortize them. +- Compared to an optimized alternative that would just bind the compute + descriptor set and schedule a compute pipeline dispatch, leaving the rest up + to higher-level code, our `update()` function needs to handle many unrelated + concerns, and gets a more complex signature and complicated calling procedure + as a result. + +Therefore, as soon as we get the simulation to run, this should be our first +target for optimization. + + +## Exercise + +At long last, we are ready to roll out a full Vulkan-based GPU simulation. + +Integrate the simulation update function discussed above at the root of the +`gpu` module of the codebase, then write a new version of `run_simulation()` in +the same module that uses all the infrastructure introduced so far to run the +simulation on the GPU. + +Next, make the simulation use the gpu version of `run_simulation()`. To enable +CPU/GPU comparisons, it is a good idea to to keep it easy to run the CPU +version. For example, you could have an off-by-default `cpu: bool` runner option +which lets you switch back to CPU mode. + +Once the simulation builds, make sure that it runs successfully and without +significant[^5] Vulkan validation warnings. You can check the latter by running +the simulation binary in debug mode (`cargo run --bin simulate` without the +usual `--release` option). Beware that such a debug binary can be slow: you will +want to run it with a smaller number of simulation steps. + +And finally, once the simulation binary seems to run successfully, you will +adjust the microbenchmark to evaluate its performance: + +- Add a new GPU update microbenchmark in addition to the existing CPU + microbenchmark, so that you can measure how the two implementations compare. +- Add another GPU microbenchmark that measures the performance of getting the + data from the GPU to the CPU (via `Concentrations::current_v()`). + +Then run the resulting microbenchmark, and compare the performance +characteristics of our first GPU version to the optimized CPU version. + + +--- + +[^1]: I'm purposely being vague about what a "visual component" is, since even + back in the day there were already two kinds of components with + configurable processing (triangle vertex and pixel/fragment), and the list + has grown enormously since: geometry primitives, tesselator inputs and + outputs, rays, meshlets... + +[^2]: ...albeit with bad performance because SIMD can't help leaking through + every nice abstraction and making the life of everyone more complicated. + +[^3]: First came sub-groups/warps, which expose hardware SIMD instructions. Then + came NVidia's thread-block clusters, which to my knowledge do not have a + standardized Khronos name yet, but are basically about modeling L2 caches + shards shared by multiple GPU compute units. + +[^4]: It could be done with a simple `if` in the current version, but would + become much harder when introducing later optimizations that leverage + the communication capabilities of work-groups and subgroups. + +[^5]: As a minor spoiler, you will find that even `vulkano` gets some subtleties + of Vulkan wrong. diff --git a/handouts/src/SUMMARY.md b/handouts/src/SUMMARY.md index 28ec4f8db37d4f41ddf8aa75ea822dfe6fa20640..07a928f860678dbfddbb5c88703f2ab372c9a513 100644 --- a/handouts/src/SUMMARY.md +++ b/handouts/src/SUMMARY.md @@ -34,9 +34,8 @@ - [Shader](22-shady-business.md) - [Pipeline](23-gazoduc.md) - [Resources](24-resourceful.md) -- [First run]() -- [Writing output]() -- [Optimizing]() +- [Execution](25-simulating.md) +- [Optimizations]() ---