At work we are getting a 64-bit version of our software up and running at the moment. Most of the usual culprits reared their head – assuming that a pointer and integer had the same size etc etc.
One more interesting one, which I’ve not come across before is related to using STL string::find
and the special constant string::npos
. This is not unique to our code base when you google for it and actually just boils down to data being truncated before a comparison. The nuances of the problem do lead on to a discussion about signed vs. unsigned integral types in C++ and the handling of comparisons between differently sized data types. I though it was worth looking at a bit further and definitely something to watch out for when doing code reviews.
It could also make for a particularly challenging interview question 😉
string::npos
The following code snippet exhibits the problem:
#include #include using namespace std; int main(void) { string a = "hello"; unsigned int pos = a.find("foo"); if(pos == string::npos) cout << "foo not found" << endl; else cout << "foo found" << endl; return 0; }
If you compile (g++ -m32 -o test test.cpp
) and run that code at 32-bit it will output:
$./test
foo not found
If you compile (g++ -m64 -o test test.cpp
) and run that code at 64-bit it will output:
$./test
foo found
Certainly not what was expected! gcc will actually warn you to expect a problem:
warning: comparison is always false due to limited range of data type
What is going on?
The problem comes from the fact that the type of string::npos
is of type size_t and not of type integer. If you do the correct thing and change the type of pos to be size_t then the code works as expected.
If you read the help of string::npos
, if contains two pieces of information relevant to this problem:
npos is a static member constant value with the greatest possible value for an element of type size_t.
and
This constant is actually defined with a value of -1 (for any trait), which because size_t is an unsigned integral type, becomes the largest possible representable value for this type.
From the definition of string::npos
we can see that it’s value will be dependent upon sizeof(size_t)
. So what do we see at 32-bit and 64-bit:
- At 32-bit
sizeof(size_t)==4
and hence has a value of 0xFFFFFFFF - At 64-bit
sizeof(size_t)==8
and hence has a value of 0xFFFFFFFFFFFFFFFF
From this point onwards it should hopefully be fairly obvious as to why our comparison in the code above is misbehaving. Putting the return value of string::find
into an unsigned integer causes it to be truncated to 4 bytes when compiled at 64-bit, which will then cause the subsequent comparison to fail.
When comparing two integral types of differing sizes the smaller one is promoted up to match the larger, so the following happens:
- The 32-bit unsigned integer is converted to a 64-bit unsigned integer by padding with zeros
- The 32-bit value of 0xFFFFFFFF becomes 0x00000000FFFFFFFF
- 0x00000000FFFFFFFF is then compared to 0xFFFFFFFFFFFFFFFF, which fails
signed, unsigned conversions and promotion
If pos had been declared as an int rather than an unsigned int then this code actually behaves as expected, albeit whilst generating a compiler warning (more on that later). The reasons for this requires digging down into a few details of the C++ language spec.
Two factors come into play when pos is an integer:
- We are comparing data types of a different size
- We are comparing a signed values against an unsigned value
A similar process to the one outlined above happens, but the crucial difference is the fact that we are now working with a signed type rather than an unsigned type. The following happens when using a signed integer:
- The 32-bit signed integer is converted to a 64-bit signed integer with the sign bit extended
- Hence the 32-bit value of 0xFFFFFFFF becomes 0xFFFFFFFFFFFFFFFF
- The 64-bit signed value is converted to unsigned, which doesn’t actually change the bit pattern
- The comparison now behaves as expected
So the crucial difference here is how signed and unsigned values are treated:
- When an unsigned value is promoted up to a larger data type it is zero padded
- When a signed value is promoted up the sign bit is extended out if it is set
The other time the difference between signed and unsigned values becomes important is when doing a right bit shift operations -the same rules apply as to what the bits on the left hand side get set to. See Arithmetic Shift vs. Logical Shift for more details.
Conclusions
This leads onto an interesting point and one which I suspect explains how the code ended up the way it was. For people who aren’t aware of the fact that string::find returns data of type size_t rather than an int, I imagine the following situation occurs:
- Write the code to store the value in a (signed) int
- The compiler issues a warning about signed/unsigned comparisons
- The code is changed to use an unsigned int and the warning goes away
It just goes to show that assumptions about data types can really come back to bite you. I guess the main thing to say about all of this is – compiler warnings are there for a reason. They are your friend, understand what it is complaining about and fix it! And fix it properly, don’t just do something kludgy to silence the warning.
Thanks for the tip! This saved my life in a big school project. I didn’t even know that size_t existed as a separate type…
Glad to help! It struck me as something worth writing about – mainly given the strange behaviour you can accidentally trigger.
I think that in new STL implementation it should be referred as string::size_type and not size_t:
http://www.cplusplus.com/reference/string/string/find/
look for the “Basic template member declarations” at the end:
size_type find ( const basic_string& str, size_type pos = 0 ) const;
Yes indeed, thank you for drawing my attention to that.
This discussion on stackoverflow (http://stackoverflow.com/questions/918567/size-t-vs-containersize-type) provides a good summary of size_t versus size_type.
Actually there is a proper way to do this for both 32 and 64 bit and that is to use the correct std::size_type. an example would be:
#include
#include
using namespace std;
int main(void)
{
std::string a = “hello”;
std::string::size_type pos = 0;
// Now pos handles the correct sign and and length for the platform
pos = a.find(“foo”);
if(pos == std::string::npos)
cout << "foo not found" << endl;
else
cout << "foo found" << endl;
return 0;
}
the issue is that std::string::npos is -1, and the type is unsigned, so -1 will wrap around and be the max value of the size. A comparison with an unsigned pos will not match properly because " and i quote, unsigned types can not have negative values!".
You would then need to:
1. not used unsigned for std::string positions (pos).
2. cast std::string::npos to signed.
int pos = 0; // signed by default
Example: (pos == (signed)std::string::npos).
But that's dumb, so it's much easier and correct to use the std::string::size_type when using std::strings and indexing positions.
Hope this helps.