Definining Data Decomposition across CUDA, OpenCL, Metal
When attempting to solve computational problems using accelerators (for example graphics processing units) a central challenge is to decompose the computation into many small identical problems. These small problems are then mapped to execution units of a given accelerator and solved in parallel.
The three household low level accelerator frameworks are CUDA, OpenCL and Metal. Next to mapping sub-problems to execution units they also allow for sub-problems to be grouped together. Groups of sub-problems have certain interesting properties:
- they can be synchronized and
- they can share data.
Accelerator frameworks thus ask developers to define two layers of data decomposition: (1) the overall size of the problem space and (2) the size of a group of sub-problems.
Moving a problem from one accelerator framework to another or implementing a solution using multiple accelerator frameworks can be interesting. Frameworks are very similar but the devil is in the detail. There is no standard of mapping sub-problems to execution units. Both the naming conventions and the semantics are different:
CUDA | OpenCL | Metal | Aura | |
---|---|---|---|---|
level 1 | grid | global work | threads per group | mesh |
level 2 | block | local work | thread groups | bundle |
overall | grid * block | global work | threads per group * thread groups |
mesh * bundle |
I added to this table the naming convention and semantics for for the Aura library that is under development. The library wraps the three standard accelerator frameworks and exposes a single API for all three.
No Comments
Sorry, the comment form is closed at this time.