8-bit Data Processing

Getting good performance from CUDA with 8-bit per pixel image data can at first glance be awkward due to the memory access rules – at a minimum you should be reading and writing 32-bit values. With a bit of planning it is actually pretty simple.

Figuring all of this out has helped me get to grips with optimising CUDA code and understand the performance counters provided by the CUDA driver and tools. I’m going to put together a series of posts going over the various problems and solutions I’ve found to dealing with 8-bit data.

CUDA Optimisation

Optimising code for CUDA can be tricky and the biggest challenge is getting to grips with the memory access rules.

Following the adage of keep it simple stupid I was working on a very simple function trying to get to grips with the NVidia performance tools. I came across some results that seemed counter intuitive, but on closer examination of the ptx code it turns out the compiler was optimising in a way I wasn’t expecting. Something to bear in mind if you’re having trouble sorting out memory access problems.

Background

Various YUV pixel formats come up frequently when working with both live video and movie files. YUV formats generally break down into two types – packed and planar. Packed is the easiest to work with as it’s the most similar to RGB in that the data for each of the colour channels is packed together to form pixels. In planar formats the different colour components are split up into separate buffers. There’s a good summary of the various formats over at fourcc.org.

This function extracts the luminance signal from the packed signal, which in this case is in UYVY format.

This code is written to operate on an HD frame, so 1920×1080 pixels in size and runs with a block width of 32 threads. Each thread processes 2 UYVY pixels, so each thread block processes 64 pixels, which rather handily happens to be an exact multiple of 1920 i.e. we can conveniently ignore boundary conditions here 🙂

Continue reading

Courtesy of the OpenCV 2.3 GPU code comes a neat snippet of code for using a template parameter for reading RGB or BGR ordered components when dealing with RGB triplets.

The Code

  1. template <int blueIndex>
  2. float rgb2grey(const float *src)
  3. {
  4.    return 0.114f*src[blueIndex^2] + 0.587f*src[1] + 0.299f*src[blueIndex];
  5. }

Then to use the function you simply supply the index where the blue value resides to take care of the RGB vs BGR ordering.

For RGB ordering:

float g = rgb2grey(src);

And for BGR ordering:

float g = rgb2grey(src);

This only works for swapping R and B around and won’t work for more weird and wonderful orderings

The original OpenCV code can be found in modules/gpu/src/opencv2/gpu/device/detail/color.hpp.

Templates and CUDA

For me the fact that you can use template meta-programming is a real plus point of CUDA. It allows for good code re-use and the template expansion gives good scope for the compiler to optimise. It can also allow you to remove conditionals from kernels in appropriate circumstances – more in a future post!

Subversion is a bit lacking on the merging and branching front in comparison with some of the newer distributed version control systems, but it does make working with large projects easy. The single biggest reason for this is sparse checkouts.

At work we have a source tree that contains artwork, documentation, the source for all the third party libraries we use and compiled versions of these for multiple architectures and platforms (32bit, 64bit, windows, RHEL4, RHEL5 etc). This makes the full checkout rather large and there are many times when you only need a small fraction of all the files.

The problem with sparse checkouts is that it can become very laborious manually setting one up for anything beyond a few directories. For more details on the basics see here. As with many tedious tasks – computers can help!

I put together a little script for helping with this – get it here. I hope it’s useful and keep reading for more details on how it works.

In Action

The final version of the script makes doing a sparse checkout as simple as doing an standard checkout:
./checkout.rb svn://server/trunk

To checkout using a named subset of files:
./checkout.rb --map documentation svn://server/trunk

To checkout using a locally defined subset of files (rather than a subset stored on the sever):
./checkout.rb --map local.yaml svn://server/trunk

Continue reading

At work we are getting a 64-bit version of our software up and running at the moment. Most of the usual culprits reared their head – assuming that a pointer and integer had the same size etc etc.

One more interesting one, which I’ve not come across before is related to using STL string::find and the special constant string::npos. This is not unique to our code base when you google for it and actually just boils down to data being truncated before a comparison. The nuances of the problem do lead on to a discussion about signed vs. unsigned integral types in C++ and the handling of comparisons between differently sized data types. I though it was worth looking at a bit further and definitely something to watch out for when doing code reviews.

It could also make for a particularly challenging interview question 😉

Continue reading