23

I've been looking around a fair bit for an answer. I'm going to make a series of my own string functions like my_strcmp(), my_strcat(), etc.

Does strcmp() work through each index of two arrays of characters and if the ASCII value is smaller at an identical index of two strings, that string is there alphabetically greater and therefore a 0 or 1 or 2 is returned? I guess what Im asking is, does it use the ASCII values of characters to return these results?

Any help would be greatly appreciated.

[REVISED]

OK, so I have come up with this... it works for all cases except when the second string is greater than the first.

Any tips?

int my_strcmp(char s1[], char s2[])
{   
    int i = 0;
    while ( s1[i] != '\0' )
    {
        if( s2[i] == '\0' ) { return 1; }
        else if( s1[i] < s2[i] ) { return -1; }
        else if( s1[i] > s2[i] ) { return 1; }
        i++;
    }   
    return 0;
}


int main (int argc, char *argv[])
{
    int result = my_strcmp(argv[1], argv[2]);

    printf("Value: %d \n", result);

    return 0;

}
Alistair Gillespie
  • 530
  • 1
  • 5
  • 22
  • 2
    Why don't you just look at an implementation (glibc's, or any other - search for "strcmp source code")? (And for the return value and specification, read the man page or POSIX.) – Mat Aug 27 '12 at 04:44
  • @DizzyChamp `strcmp` uses a [*lexicographical order*](http://en.wikipedia.org/wiki/Lexicographical_order) – obataku Aug 27 '12 at 04:46

9 Answers9

33

The pseudo-code "implementation" of strcmp would go something like:

define strcmp (s1, s2):
    p1 = address of first character of str1
    p2 = address of first character of str2

    while contents of p1 not equal to null:
        if contents of p2 equal to null: 
            return 1

        if contents of p2 greater than contents of p1:
            return -1

        if contents of p1 greater than contents of p2:
            return 1

        advance p1
        advance p2

    if contents of p2 not equal to null:
        return -1

    return 0

That's basically it. Each character is compared in turn an a decision is made as to whether the first or second string is greater, based on that character.

Only if the characters are identical do you move to the next character and, if all the characters were identical, zero is returned.

Note that you may not necessarily get 1 and -1, the specs say that any positive or negative value will suffice, so you should always check the return value with < 0, > 0 or == 0.

Turning that into real C would be relatively simple:

int myStrCmp (const char *s1, const char *s2) {
    const unsigned char *p1 = (const unsigned char *)s1;
    const unsigned char *p2 = (const unsigned char *)s2;

    while (*p1 != '\0') {
        if (*p2 == '\0') return  1;
        if (*p2 > *p1)   return -1;
        if (*p1 > *p2)   return  1;

        p1++;
        p2++;
    }

    if (*p2 != '\0') return -1;

    return 0;
}

Also keep in mind that "greater" in the context of characters is not necessarily based on simple ASCII ordering for all string functions.

C has a concept called 'locales' which specify (among other things) collation, or ordering of the underlying character set and you may find, for example, that the characters a, á, à and ä are all considered identical. This will happen for functions like strcoll.

paxdiablo
  • 772,407
  • 210
  • 1,477
  • 1,841
  • `strcmp` has nothing to do with locale. It compares byte values as `unsigned char`. – R.. GitHub STOP HELPING ICE Aug 27 '12 at 05:05
  • Sorry, @R, I didn't mean specifically `strcmp`, rather I was talking about the string functions as a whole. I'll clarify. – paxdiablo Aug 27 '12 at 05:12
  • Most of the functions whose names begin with `str` have nothing to do with locale. – R.. GitHub STOP HELPING ICE Aug 27 '12 at 05:13
  • 1
    Except strcoll and strxfrm, which are locale-sensitive. strcoll is basically locale-aware strcmp, and strxfrm transforms strings so that strcmp will perform locale-aware comparisons. Of course, you're probably still better off using a real internationalization library if possible. – nneonneo Aug 27 '12 at 05:23
  • 1
    Err, I'm not contending that all (or even most, for that matter) strXXX functions use locale. I'm contending that not all of them ignore it. As I point out, strcoll is one that uses it and nneonneo also mentions strxfrm. I mentioned that because OP stated "I'm going to make a _series_ of my own string functions". That series may or may not consist of all the ISO C strXXX functions, I just thought it safer to cover all possibilities. – paxdiablo Aug 27 '12 at 05:30
  • If there's a way you think it could be better worded in the answer, please let me know. I'm happy to change it. – paxdiablo Aug 27 '12 at 05:30
  • @paxdiablo - where does it say the comparison is done as unsigned char? – technosaurus May 12 '14 at 04:12
  • 3
    nevermind - "The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared." ... good to know – technosaurus May 12 '14 at 04:28
  • 1
    Note: `if (*p2 == '\0') return 1;` is not needed. – chux - Reinstate Monica Mar 15 '15 at 21:07
10

Here is the BSD implementation:

int
strcmp(s1, s2)
    register const char *s1, *s2;
{
    while (*s1 == *s2++)
        if (*s1++ == 0)
            return (0);
    return (*(const unsigned char *)s1 - *(const unsigned char *)(s2 - 1));
}

Once there is a mismatch between two characters, it just returns the difference between those two characters.

chrisaycock
  • 32,202
  • 12
  • 79
  • 116
  • 6
    It's worth noting that it doesn't *always* return the difference between the two differing characters; it is actually permitted to return any integer provided the sign is the same as the difference between the bytes. A misunderstanding of this implementation detail (in the related memcmp function) resulted in a real-world [security vulnerability in MySQL](http://seclists.org/oss-sec/2012/q2/493). – nneonneo Aug 27 '12 at 04:52
  • @nneonneo That's an excellent point. Implementing it via "difference between characters" is good, but relying on that as a user is bad. – chrisaycock Aug 27 '12 at 04:55
  • 4
    As always, RTFM before using a function. Even the C standard is blatantly clear over how strcmp() behaves: `The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2.` That's it, no further guarantees. – Lundin Aug 27 '12 at 06:44
9

It uses the byte values of the characters, returning a negative value if the first string appears before the second (ordered by byte values), zero if they are equal, and a positive value if the first appears after the second. Since it operates on bytes, it is not encoding-aware.

For example:

strcmp("abc", "def") < 0
strcmp("abc", "abcd") < 0 // null character is less than 'd'
strcmp("abc", "ABC") > 0 // 'a' > 'A' in ASCII
strcmp("abc", "abc") == 0

More precisely, as described in the strcmp Open Group specification:

The sign of a non-zero return value shall be determined by the sign of the difference between the values of the first pair of bytes (both interpreted as type unsigned char) that differ in the strings being compared.

Note that the return value may not be equal to this difference, but it will carry the same sign.

nneonneo
  • 154,210
  • 32
  • 267
  • 343
  • Best answer here, actually no one spoke about this __null character is less than__ any other character – AboAmmar Dec 10 '17 at 17:32
4

Here is my version, written for small microcontroller applications, MISRA-C compliant. The main aim with this code was to write readable code, instead of the one-line goo found in most compiler libs.

int8_t strcmp (const uint8_t* s1, const uint8_t* s2)
{
  while ( (*s1 != '\0') && (*s1 == *s2) )
  {
    s1++; 
    s2++;
  }

  return (int8_t)( (int16_t)*s1 - (int16_t)*s2 );
}

Note: the code assumes 16 bit int type.

Lundin
  • 155,020
  • 33
  • 213
  • 341
  • 1
    Note: The `&& (*s2 != '\0')` is not needed. – chux - Reinstate Monica Mar 15 '15 at 20:48
  • This code is incorrect: `(*s1 != *s2)` should be `(*s1 == *s2)`. Why should `strcmp` return `int8_t`? it is defined to return an `int`! The final `return` has 3 useless casts: `return (*s1 > *s2) - (*s1 < *s2);` makes no assumptions about the size of `int` or that of `char`. – chqrlie Jan 02 '17 at 01:59
  • 1
    The return value `(int8_t)( (int16_t)*s1 - (int16_t)*s2 );` is actually incorrect if the strings can contain non-ASCII characters as the difference of 2 characters can exceed the range of `int8_t`. You would get implementation defined behavior and might get the an incorrect sign. – chqrlie Jan 02 '17 at 02:09
  • @chqrlie Indeed, no idea where the != came from. The original code where I took the snippet from didn't have that bug. Fixed. – Lundin Jan 02 '17 at 07:14
  • @chqrlie As for return type `int8_t`, it is an optimization for microcontroller systems, where `int` would be ineffective to use. The casts is there to ensure that each operand is of the intended type. When writing MISRA-compliant code, you aren't allowed to write code that depends on implicit type promotions, which is a sound rule. Similarly, you aren't allowed to write code that relies on implicit type conversions to a smaller type, hence the final cast. – Lundin Jan 02 '17 at 07:19
  • @Lundin: from my own experience auditing this kind of code, these constraints do not help programmers produce better or safer code. In this particular case, the sign of the result is incorrect for simple cases. eg: `strcmp("\001", "\377");` returns `2` where the result should be negative and would be if the return type was `int`. – chqrlie Jan 02 '17 at 08:20
  • @chqrlie Your test case is flawed even for standard C strcmp with the signature `int strcmp(const char *s1, const char *s2);`. There are no guarantees for what will happen if you attempt to store the value 377o inside a `char` variable, so your test case invokes undefined behavior. That being said, this code obviously assumes 7 bit ASCII. This was written for embedded system environments were extended symbol tables aren't used. Or if they were, `wcscmp` would be used, not `strcmp`. – Lundin Jan 02 '17 at 08:44
  • @Lundin: `"\377"` is a perfectly valid string if `char` has at least eight bits. It defines the same string as `"\xFF"` and if the `char` type is signed, this character is negative, as is the character literal `'\377'`. *Extended symbol tables* are a completely unrelated topic and I agree you definitely do not want to use the wide character APIs in an embedded system. – chqrlie Jan 02 '17 at 12:09
  • @chqrlie No. If `char` has 8 bits but is signed, the literal `\377` is too large to fit inside it. The literal itself is of type `int`, so your code is equivalent to writing something like for example `char ch = 255;` Such code is not safe, this invokes poorly specified behavior as per C11 6.3.1.3 "Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.". This is a bug in your test case and not in the posted answer. – Lundin Jan 02 '17 at 12:27
  • 2
    @Lundin: I'm afraid you are mistaken: if the `char` type is 8-bit signed, the character literal `'\377'` indeed has type `int` but its value is `-1` (cf C11 6.4.4.4 Character constants pp9 to 13 with an equivalent example `'\xFF'`). Regarding string literals, C11 6.4.5 p4 clearly states that *The same considerations apply to each element of the sequence in a string literal as if it were in an integer character constant...*. Hence, on the same platform, the string literal, `"\377"` is parsed as having a single character with value `-1` as well. Is there another issue in the test case? – chqrlie Jan 02 '17 at 18:05
4

This, from the masters themselves (K&R, 2nd ed., pg. 106):

// strcmp: return < 0 if s < t, 0 if s == t, > 0 if s > t
int strcmp(char *s, char *t) 
{
    int i;

    for (i = 0; s[i] == t[i]; i++)
        if (s[i] == '\0')
            return 0;
    return s[i] - t[i];
}
mihai
  • 3,988
  • 3
  • 23
  • 41
2

This code is equivalent, shorter, and more readable:

int8_t strcmp (const uint8_t* s1, const uint8_t* s2)
{
    while( (*s1!='\0') && (*s1==*s2) ){
        s1++; 
        s2++;
    }

    return (int8_t)*s1 - (int8_t)*s2;
}

We only need to test for end of s1, because if we reach the end of s2 before end of s1, the loop will terminate (since *s2 != *s1).

The return expression calculates the correct value in every case, provided we are only using 7-bit (pure ASCII) characters. Careful thought is needed to produce correct code for 8-bit characters, because of the risk of integer overflow.

david k
  • 21
  • 1
  • Why post a solution that is correct for "only using 7-bit (pure ASCII) characters"? Better to post a solution that works per the C spec. – chux - Reinstate Monica Mar 15 '15 at 20:55
  • The cast to `int8_t` on the last line is superfluous, because each operand will get implicitly promoted to `int` anyhow. – Lundin Mar 16 '15 at 07:58
  • The return expression would be OK for 8-bit characters if the return type was `int`. The casts are useless. There is no risk on integer overflow. Indeed you could not use this expression if type `char` had the same size as type `int`, which can occur in some DSP architectures. But since you define the return value to be non-standard `int8_t`, you cannot use this expression if 2 characters can be more than `127` apart. – chqrlie Jan 02 '17 at 02:05
1

I found this on web.

http://www.opensource.apple.com/source/Libc/Libc-262/ppc/gen/strcmp.c

int strcmp(const char *s1, const char *s2)
{
    for ( ; *s1 == *s2; s1++, s2++)
        if (*s1 == '\0')
            return 0;
    return ((*(unsigned char *)s1 < *(unsigned char *)s2) ? -1 : +1);
}
chqrlie
  • 98,886
  • 10
  • 89
  • 149
Matthew
  • 33
  • 6
-1

This is how I implemented my strcmp: it works like this: it compares first letter of the two strings, if it is identical, it continues to the next letter. If not, it returns the corresponding value. It is very simple and easy to understand: #include

//function declaration:
int strcmp(char string1[], char string2[]);

int main()
{
    char string1[]=" The San Antonio spurs";
    char string2[]=" will be champins again!";
    //calling the function- strcmp
    printf("\n number returned by the strcmp function: %d", strcmp(string1, string2));
    getch();
    return(0);
}

/**This function calculates the dictionary value of the string and compares it to another string.
it returns a number bigger than 0 if the first string is bigger than the second
it returns a number smaller than 0 if the second string is bigger than the first
input: string1, string2
output: value- can be 1, 0 or -1 according to the case*/
int strcmp(char string1[], char string2[])
{
    int i=0;
    int value=2;    //this initialization value could be any number but the numbers that can be      returned by the function
    while(value==2)
    {
        if (string1[i]>string2[i])
        {
            value=1;
        }
        else if (string1[i]<string2[i])
        {
            value=-1;
        }
        else
        {
            i++;
        }
    }
    return(value);
}
Indrajeet
  • 4,827
  • 2
  • 26
  • 42
eshel
  • 9
  • 1
    This answer in incorrect. Per the C spec, the compares `string1[i]>string2[i]`, `string1[i] – chux - Reinstate Monica Mar 15 '15 at 20:58
  • 2
    This code is incorrect because when the strings are equal, it hangs the program, and while doing so reads past the null termination and therefore likely out-of-bounds, causing undefined behavior. – Lundin Mar 16 '15 at 09:33
-2

Is just this:

int strcmp(char *str1, char *str2){
    while( (*str1 == *str2) && (*str1 != 0) ){
        ++*str1;
        ++*str2;
    }
    return (*str1-*str2);
}

if you want more fast, you can add "register " before type, like this: register char

then, like this:

int strcmp(register char *str1, register char *str2){
    while( (*str1 == *str2) && (*str1 != 0) ){
        ++*str1;
        ++*str2;
    }
    return (*str1-*str2);
}

this way, if possible, the register of the ALU are used.

  • your implementation is incorrect, and so is the OP's: the return value should reflect the result of the comparison of `unsigned char` values, not `char`. More importantly, `++*str1;` and `++*str2;` should be simply `++str1;` and `++str2;`. – chqrlie Jan 02 '17 at 01:52