8-bit Data Processing

Getting good performance from CUDA with 8-bit per pixel image data can at first glance be awkward due to the memory access rules - at a minimum you should be reading and writing 32-bit values. With a bit of planning it is actually pretty simple.

Figuring all of this out has helped me get to grips with optimising CUDA code and understand the performance counters provided by the CUDA driver and tools. I'm going to put together a series of posts going over the various problems and solutions I've found to dealing with 8-bit data.

CUDA Optimisation

Optimising code for CUDA can be tricky and the biggest challenge is getting to grips with the memory access rules.

Following the adage of keep it simple stupid I was working on a very simple function trying to get to grips with the NVidia performance tools. I came across some results that seemed counter intuitive, but on closer examination of the ptx code it turns out the compiler was optimising in a way I wasn't expecting. Something to bear in mind if you're having trouble sorting out memory access problems.

Background

Various YUV pixel formats come up frequently when working with both live video and movie files. YUV formats generally break down into two types - packed and planar. Packed is the easiest to work with as it's the most similar to RGB in that the data for each of the colour channels is packed together to form pixels. In planar formats the different colour components are split up into separate buffers. There's a good summary of the various formats over at fourcc.org.

This function extracts the luminance signal from the packed signal, which in this case is in UYVY format.

This code is written to operate on an HD frame, so 1920x1080 pixels in size and runs with a block width of 32 threads. Each thread processes 2 UYVY pixels, so each thread block processes 64 pixels, which rather handily happens to be an exact multiple of 1920 i.e. we can conveniently ignore boundary conditions here 🙂

Version 1

The Code

The first version of the code was as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
__global__ void uyvy2y_v0(unsigned char *output,size_t oPitch,const unsigned char *input,size_t width,size_t iPitch,size_t height)
{
        const int ix =  threadIdx.x;
        const int iy =  threadIdx.y;
        const int x =   blockDim.x*blockIdx.x + ix;
        const int y =   blockDim.y*blockIdx.y + iy;
 
        //pointer to the input pixel at this thread
        const uchar4 *ip = (const uchar4 *)&input[y*iPitch + x*4*2];
        uchar4 *op = (uchar4 *)&output[y*oPitch + x*4];
 
        uchar4 uyvy0 = ip[0];
        uchar4 uyvy1 = ip[1];
 
        uchar4 yVals = {uyvy0.y,uyvy0.w,uyvy1.y,uyvy1.w};
        *op = yVals;
}

Performance

This code reads and writes 32-bit values and so in theory it should be fairly efficient in terms of global reads/writes. The performance counters however tell a different story we're only getting 12.5% Global Load Efficiency when compiled for SM 1.x.

Analysis

Looking at the ptx code generated (for SM 1.x) there are several 8-bit load instructions (ld.global.u8 - lines 25, 27, 28 and 37):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
2526
272829
30
31
32
33
34
35
36
3738
39
40
41
        .reg .u16 %rh<6>;
        .reg .u32 %r<14>;
        .reg .u64 %rd<15>;
        .loc    14      1       0
$LDWbegin__Z9uyvy2y_v0PhmPKhmmm:
        .loc    14      12      0
        mov.u16         %rh1, %ctaid.x;
        mov.u16         %rh2, %ntid.x;
        mul.wide.u16    %r1, %rh1, %rh2;
        cvt.s32.u16     %r2, %tid.x;
        add.u32         %r3, %r2, %r1;
        cvt.s32.u16     %r4, %tid.y;
        mov.u16         %rh3, %ctaid.y;
        mov.u16         %rh4, %ntid.y;
        mul.wide.u16    %r5, %rh3, %rh4;
        add.u32         %r6, %r4, %r5;
        cvt.s64.s32     %rd1, %r6;
        ld.param.u64    %rd2, [__cudaparm__Z9uyvy2y_v0PhmPKhmmm_input];
        mul.lo.s32      %r7, %r3, 8;
        cvt.s64.s32     %rd3, %r7;
        ld.param.u64    %rd4, [__cudaparm__Z9uyvy2y_v0PhmPKhmmm_iPitch];
        mul.lo.u64      %rd5, %rd1, %rd4;
        add.u64         %rd6, %rd3, %rd5;
        add.u64         %rd7, %rd2, %rd6;
        ld.global.u8    %r8, [%rd7+3];        .loc    14      13      0
        ld.global.u8    %r9, [%rd7+5];        ld.global.u8    %r10, [%rd7+7];        .loc    14      16      0
        ld.param.u64    %rd8, [__cudaparm__Z9uyvy2y_v0PhmPKhmmm_output];
        mul.lo.s32      %r11, %r3, 4;
        cvt.s64.s32     %rd9, %r11;
        ld.param.u64    %rd10, [__cudaparm__Z9uyvy2y_v0PhmPKhmmm_oPitch];
        mul.lo.u64      %rd11, %rd1, %rd10;
        add.u64         %rd12, %rd9, %rd11;
        add.u64         %rd13, %rd8, %rd12;
        ld.global.u8    %r12, [%rd7+1];        st.global.v4.u8         [%rd13+0], {%r12,%r8,%r9,%r10};
        .loc    14      17      0
        exit;
$LDWend__Z9uyvy2y_v0PhmPKhmmm:

For SM 2.x we get 25% Global Load Efficiency - which is a boost and is reflected in the timings, but still not good enough. The SM 2.x ptx code is fairly different, the main change being a call to ld.global.v4.u8.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
3940
41
42
43
44
45
46
        .reg .s32       %r<15>;
        .reg .s64       %rl<15>;
        .reg .s16       %rc<29>;
 
        ld.param.u64    %rl1, [_Z9uyvy2y_v0PhmPKhmmm_param_0];
        ld.param.u64    %rl2, [_Z9uyvy2y_v0PhmPKhmmm_param_1];
        ld.param.u64    %rl3, [_Z9uyvy2y_v0PhmPKhmmm_param_2];
        ld.param.u64    %rl4, [_Z9uyvy2y_v0PhmPKhmmm_param_4];
        cvta.to.global.u64      %rl5, %rl1;
        cvta.to.global.u64      %rl6, %rl3;
        .loc 2 5 1
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        .loc 2 3 1
        mov.u32         %r3, %tid.x;
        .loc 2 5 1
        mad.lo.s32      %r4, %r1, %r2, %r3;
        .loc 2 6 1
        mov.u32         %r5, %ctaid.y;
        mov.u32         %r6, %ntid.y;
        .loc 2 4 1
        mov.u32         %r7, %tid.y;
        .loc 2 6 1
        mad.lo.s32      %r8, %r5, %r6, %r7;
        .loc 2 9 1
        cvt.s64.s32     %rl7, %r8;
        shl.b32         %r9, %r4, 3;
        .loc 2 9 1
        cvt.s64.s32     %rl8, %r9;
        mad.lo.s64      %rl9, %rl7, %rl4, %rl8;
        add.s64         %rl10, %rl6, %rl9;
        shl.b32         %r10, %r4, 2;
        .loc 2 10 1
        cvt.s64.s32     %rl11, %r10;
        mad.lo.s64      %rl12, %rl7, %rl2, %rl11;
        add.s64         %rl13, %rl5, %rl12;
        add.s64         %rl14, %rl10, 4;
        .loc 2 12 1
        ld.global.v4.u8         {%rc5, %rc6, %rc7, %rc8}, [%rl10];        .loc 2 13 1
        ld.global.u8    %rc1, [%rl10+7];
        ld.global.u8    %rc4, [%rl14+1];
        .loc 2 16 1
        st.global.v4.u8         [%rl13], {%rc6, %rc8, %rc4, %rc1};
        .loc 2 17 2
        ret;

Version 2

Working from the premise that any way of accessing the individual components without accessing all four would lead to 8-bit loads, the next version of the code accesses the memory with 32-bit ints, and uses bit shifting and masking to re-pack the data.

The Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
__global__ void uyvy2y_v1(unsigned char *output,size_t oPitch,const unsigned char *input,size_t width,size_t iPitch,size_t height)
{
        const int ix =  threadIdx.x;
        const int iy =  threadIdx.y;
        const int x =   blockDim.x*blockIdx.x + ix;
        const int y =   blockDim.y*blockIdx.y + iy;
 
        //pointer to the input pixel at this thread
        const unsigned int *ip = (const unsigned int *)&input[y*iPitch + x*4*2];
        unsigned int *op = (unsigned int *)&output[y*oPitch + x*4];
 
        unsigned int uyvy0 = ip[0];
        unsigned int uyvy1 = ip[1];
 
        unsigned int out = ((uyvy0>>8)&0xff)<<24 | ((uyvy0>>24)&0xff)<<16 | ((uyvy1>>8)&0xff)<<8 | ((uyvy1>>24)&0xff);
        *op = out;
}

Performance

The Global Memory Load Efficiency counter is now hitting 50% and is reflected by the increase in performance. The other change we can make is to do the initial load as a single 64-bit value:

1
2
3
4
5
        const uint64_t *ip = (const uint64_t *)&input[y*iPitch + x*4*2];
        ...
        uint64_t uyvy = ip[0];
        unsigned int uyvy0 = (uyvy&0xffffffff);
        unsigned int uyvy1 = ((uyvy>>32)&0xffffffff);

With this change we hit 100% Global Load Efficiency! This final change causes almost no noticeable boost to performance, but does mean we hit 100%.

Conclusions

Whilst writing up this post I've re-benchmarked the code on some newer hardware (GTX 485M - Shader Model 2.1) and the performance difference between the two versions of the function is actually not as pronounced. It mainly appears to be related to the instruction set rather than the hardware directly as compiling for Shader Model 1.3 there is still a performance hit even on newer hardware.

Leave a reply

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> 

required