8-bit Data Processing

Getting good performance from CUDA with 8-bit per pixel image data can at first glance be awkward due to the memory access rules – at a minimum you should be reading and writing 32-bit values. With a bit of planning it is actually pretty simple.

Figuring all of this out has helped me get to grips with optimising CUDA code and understand the performance counters provided by the CUDA driver and tools. I’m going to put together a series of posts going over the various problems and solutions I’ve found to dealing with 8-bit data.

CUDA Optimisation

Optimising code for CUDA can be tricky and the biggest challenge is getting to grips with the memory access rules.

Following the adage of keep it simple stupid I was working on a very simple function trying to get to grips with the NVidia performance tools. I came across some results that seemed counter intuitive, but on closer examination of the ptx code it turns out the compiler was optimising in a way I wasn’t expecting. Something to bear in mind if you’re having trouble sorting out memory access problems.


Various YUV pixel formats come up frequently when working with both live video and movie files. YUV formats generally break down into two types – packed and planar. Packed is the easiest to work with as it’s the most similar to RGB in that the data for each of the colour channels is packed together to form pixels. In planar formats the different colour components are split up into separate buffers. There’s a good summary of the various formats over at fourcc.org.

This function extracts the luminance signal from the packed signal, which in this case is in UYVY format.

This code is written to operate on an HD frame, so 1920×1080 pixels in size and runs with a block width of 32 threads. Each thread processes 2 UYVY pixels, so each thread block processes 64 pixels, which rather handily happens to be an exact multiple of 1920 i.e. we can conveniently ignore boundary conditions here 🙂

Continue reading

Courtesy of the OpenCV 2.3 GPU code comes a neat snippet of code for using a template parameter for reading RGB or BGR ordered components when dealing with RGB triplets.

The Code

  1. template <int blueIndex>
  2. float rgb2grey(const float *src)
  3. {
  4.    return 0.114f*src[blueIndex^2] + 0.587f*src[1] + 0.299f*src[blueIndex];
  5. }

Then to use the function you simply supply the index where the blue value resides to take care of the RGB vs BGR ordering.

For RGB ordering:

float g = rgb2grey(src);

And for BGR ordering:

float g = rgb2grey(src);

This only works for swapping R and B around and won’t work for more weird and wonderful orderings

The original OpenCV code can be found in modules/gpu/src/opencv2/gpu/device/detail/color.hpp.

Templates and CUDA

For me the fact that you can use template meta-programming is a real plus point of CUDA. It allows for good code re-use and the template expansion gives good scope for the compiler to optimise. It can also allow you to remove conditionals from kernels in appropriate circumstances – more in a future post!