8-bit Data Processing

Getting good performance from CUDA with 8-bit per pixel image data can at first glance be awkward due to the memory access rules – at a minimum you should be reading and writing 32-bit values. With a bit of planning it is actually pretty simple.

Figuring all of this out has helped me get to grips with optimising CUDA code and understand the performance counters provided by the CUDA driver and tools. I’m going to put together a series of posts going over the various problems and solutions I’ve found to dealing with 8-bit data.

CUDA Optimisation

Optimising code for CUDA can be tricky and the biggest challenge is getting to grips with the memory access rules.

Following the adage of keep it simple stupid I was working on a very simple function trying to get to grips with the NVidia performance tools. I came across some results that seemed counter intuitive, but on closer examination of the ptx code it turns out the compiler was optimising in a way I wasn’t expecting. Something to bear in mind if you’re having trouble sorting out memory access problems.

Background

Various YUV pixel formats come up frequently when working with both live video and movie files. YUV formats generally break down into two types – packed and planar. Packed is the easiest to work with as it’s the most similar to RGB in that the data for each of the colour channels is packed together to form pixels. In planar formats the different colour components are split up into separate buffers. There’s a good summary of the various formats over at fourcc.org.

This function extracts the luminance signal from the packed signal, which in this case is in UYVY format.

This code is written to operate on an HD frame, so 1920×1080 pixels in size and runs with a block width of 32 threads. Each thread processes 2 UYVY pixels, so each thread block processes 64 pixels, which rather handily happens to be an exact multiple of 1920 i.e. we can conveniently ignore boundary conditions here 🙂

Continue reading