Error in self written C function to split strings

Question

I just learned about string a few weeks ago, and I was interested in making my own function. At first I tried a short sentence and it worked, but I encountered a problem when I tried a long sentence...

The problem might be in the "getSplitText" function, how can I fix this error?

Is it bad to use for loop for this case?

Before

After

This is my code

#include <stdio.h>
#include <stdlib.h>

char *getSplitText(char *var_text, char split, int index);
int getLengthSplitText(char *var_text, char split);
int getLengthString(char *var_text);

int main(){
    char var_text[100];
    fgets(var_text,100,stdin);
    int i;
    printf("Total Characters: %d\n",getLengthString(var_text));
    printf("Total Words: %d\n",getLengthSplitText(var_text,' '));
    printf("Split Result:\n");
    for(i=1;i<=getLengthSplitText(var_text,' ');i++){
        printf("#%d-%s\n",i,getSplitText(var_text,' ',i));
    }

   return 0;
}

int getLengthString(char *var_text){
    int len = 0, i = 0;
    while (var_text[i] != '\0'){
        len++;
        i++;
    }
    return len;
}

int getLengthSplitText(char *var_text, char split){

    int textlen = getLengthString(var_text);
    int lenKata=0;
    char *resultKata = malloc(sizeof(char) * textlen);
    resultKata[0]='\0';
    int i,j=0;
    for(i=0;i<textlen;i++){

        if(var_text[i]!=split){

            if(textlen-1==i){
                resultKata[j] = var_text[i];
                resultKata[j + 1] = '\0';
                lenKata ++;
            }else {
                resultKata[j] = var_text[i];
                j++;
                resultKata[j + 1] = '\0';
            }
            //printf("#1 %s\n",resultKata);

        }else if(var_text[i]==split){

            if(resultKata[0]!='\0'){
                lenKata ++;
                }else{
                    j = 0;
                    resultKata[0] = '\0';
                }
            }
            //printf("#2 %s\n",resultKata);
        }
    return lenKata;
    }

char *getSplitText(char *var_text,char split, int index){

    int textlen = getLengthString(var_text);
    char *resultKata = malloc(sizeof(char) * textlen);
    resultKata[0]='\0';

    if(index<=getLengthSplitText(var_text,split)) {
        int i,j=0;
        int lenKata=0;

        for (i = 0; i < textlen; i++) {

            if (var_text[i] != split) {

                if (textlen - 1 == i) {
                    resultKata[j] = var_text[i];
                    resultKata[j + 1] = '\0';
                    lenKata++;
                    break;
                } else {
                    resultKata[j] = var_text[i];
                    j++;
                    resultKata[j + 1] = '\0';
                }
                //printf("#1 %s\n",resultKata);

            } else if (var_text[i] == split) {

                if (resultKata[0] != '\0') {
                    lenKata++;
                    if (lenKata == index) {
                        resultKata[j] = '\0';
                        break;
                    } else {
                        j = 0;
                        resultKata[0] = '\0';
                    }
                }
                //printf("#2 %s\n",resultKata);
            }
        }
    }
    return resultKata;
}

As a purely stylistic matter, you should use `break;` instead of `goto END_LOOP;` — Tom Karzes, Jul 17 '19 at 01:47
You should also never use `gets`. The man page for it explicitly says this. This may be the cause of your problem (or at least, allowing it to go undetected). Your string buffer holds 20 `char` values, which gives you 19 usable string values plus one terminating null character. If you enter any string larger than 19 characters, you will write past the end of your array and be subject to undefined behavior. — Tom Karzes, Jul 17 '19 at 01:49
The function: `gets()` has been depreciated for many years and completely removed in the last two versions of C. Suggest reading the MAN page for `fgets()` and using that function — user3629249, Jul 17 '19 at 01:50
regarding: `char *resultKata = malloc(sizeof(char) * textlen);` 1) the expression: `sizeof(char)` is defined in the C standard as 1. Multiplying anything by 1 has no effect and just clutters the code. Suggest removing that expression. 2) the function: `malloc()` is expecting a parameter of type: `size_t`, but `textlen` is declared as an `int` 3) always check (!=NULL) the returned value to assure the operation was successful. If not successful, call `perror( "malloc failed" );` to output both your error message and the text reason the system thinks the error occurred to `stderr` — user3629249, Jul 17 '19 at 01:54

Scheff's Cat · Answer 1 · 2019-07-17T09:49:46.097

There appear a lot of malloc()s in the code of OP.

Concerning the already given comments:

Multiplying anything by 1 has no effect and just clutters the code. Suggest removing that expression.

IMHO, this is a matter of style. If it improves or weakens the readability of code is a matter of personal preference. I assume the compiler is clever enough to ignore the multiplication by 1, and I would ignore as well.

the function: malloc() is expecting a parameter of type: size_t, but textlen is declared as an int

Although I agree I don't see it as a critical point as long as the provided int value will be positive.

always check (!=NULL) the returned value to assure the operation was successful.

This is what I would recommend as well.

What does me personally annoy is that there are some malloc()s in the code but not a single free(). Allocating memory with malloc() means, some heap memory is marked as “in use” in internal heap memory manager. If it is not free()d it will be marked “as used” until end of process life. Losing (e.g. overwriting) a pointer to an allocated memory makes it lost. (No possibility to address it again.) This is called memory leak.

Sure, this is not a critical issue in the exposed sample code but it might become one in larger applications with more heap memory consumption. Hence, I found it worth to be mentioned.

The whole example seems to resemble what can be done with the standard library functions

(Not that you get me wrong. I don't see anything bad to resemble standard functions for learning purposes.)

There is actually nothing wrong with getLengthString(). However, it manages two count variables i and len. The OP might have missed that they have always the same value after each iteration step. Hence, one of them is redundant and can be eliminated:

int getLengthString(char *var_text)
{
  int len = 0;
  while (var_text[len]) ++len;
  return len;
}

Concerning getLengthSplitText() (which is responsible to count words), I don't see why any malloc() is necessary at all. Thus, I wrote a new function getNumTokens() which counts the non-empty words (I call them “tokens”) which are separated by a certain split character:

int getNumTokens(char *var_text, char split)
{
  int n = 0;
  for (int inToken = 0; *var_text; ++var_text) {
    if (!inToken && *var_text != split) ++n; // count if new token starts
    inToken = *var_text != split; // update flag
  }
  return n;
}

It's a bit shorter and doesn't need any malloc().

Please, note the variable inToken which is actually used as (boolean) flag. It is responsible to remember whether a token is already counted while iterating over text. It is updated in each iteration step by assigning the result of comparison of current character (*var_text) and the separator (split).

The provided var_text is used for access and progress directly – no extra index is used. As there is only one iteration over text needed, it doesn't hurt that the pointer is changed. It's a local copy (passed by value) with limited life-time inside of function.

Concerning strtok(), I once wrote a variant strtoke() (just for fun) as answer to SO: Split string into Tokens in C, when there are 2 delimiters in a row.

There are two facts to note:

strtok() modifies the input string by replacing an occurrence of a delimiter by a \0 byte (to delimit the found token).
strtok() manages an internal global state which makes it non-reentrant.

The solution of OP seems to have none of those restrictions buying that with the usage of malloc() (with all the concerns mentioned above).

So, I changed the assumptions a bit (there are actually no restrictions mentioned in OPs question). I considered modification of input (1.) as acceptable and wrote a new function getNextToken():

char* getNextToken(char *var_text, char split)
{
  // skip space
  for (; *var_text && *var_text == split; ++var_text);
  // remember start of token
  char *token = var_text;
  // skip token
  for (; *var_text && *var_text != split; ++var_text);
  // remark end of token
  *var_text = '\0'; // doesn't hurt if there is already a '\0'
  // done
  return token;
}

The function will return the start of first token in var_text. If it is an empty string (the contents of returned pointer is \0) then no token could be found. Thereby, the end of found token is remarked with a \0-Byte (in the given input string) which might be a split character before.

As I didn't want a global internal state, I have to move from token to token outside of getNextToken(). I found this acceptable as it can be done using the function getLengthString() again. With this, a pointer can be moved from begin to end of token (where the \0-Byte has been written). Adding 1, the possible start of next token is reached. Of course, this might break when the end of text is reached (where the address after the terminator \0 might be out of bound). For our luck, the number of tokens is already known.

The complete sample:

#include <stdio.h>

int getLengthString(char *var_text);
int getNumTokens(char *var_text, char split);
char* getNextToken(char *var_text, char split);

int main()
{
  char var_text[100];
  if (!fgets(var_text, 100, stdin)) {
    fprintf(stderr, "Input failed!\n");
    return -1;
  }
  printf("Total Characters: %d\n", (int)getLengthString(var_text));
  const char split = ' ';
  const int nTokens = getNumTokens(var_text, split);
  printf("Total Words: %d\n", nTokens);
  printf("Split Result:\n");
  char *token = var_text;
  for (int i = 1; i <= nTokens; ++i) {
    token = getNextToken(token, split);
    printf("#%2d-%s\n", i, token);
    token += getLengthString(token) + 1;
  }
  return 0;
}

int getLengthString(char *var_text)
{
  int len = 0;
  while (var_text[len]) ++len;
  return len;
}

int getNumTokens(char *var_text, char split)
{
  int n = 0;
  for (int inToken = 0; *var_text; ++var_text) {
    if (!inToken && *var_text != split) ++n; // count if new token starts
    inToken = *var_text != split; // update flag
  }
  return n;
}

char* getNextToken(char *var_text, char split)
{
  // skip space
  for (; *var_text && *var_text == split; ++var_text);
  // remember start of token
  char *token = var_text;
  // skip token
  for (; *var_text && *var_text != split; ++var_text);
  // remark end of token
  *var_text = '\0'; // doesn't hurt if there is already a '\0'
  // done
  return token;
}

Output for I like C programming language but I have problem about this code.:

Total Characters: 66
Total Words: 12
Split Result:
# 1-I
# 2-like
# 3-C
# 4-programming
# 5-language
# 6-but
# 7-I
# 8-have
# 9-problem
#10-about
#11-this
#12-code.

Output for Thank you for the help. Now, the code can work. I like programming.:

Total Characters: 68
Total Words: 13
Split Result:
# 1-Thank
# 2-you
# 3-for
# 4-the
# 5-help.
# 6-Now,
# 7-the
# 8-code
# 9-can
#10-work.
#11-I
#12-like
#13-programming.

Live Demo on coliru

@auprojectsall Don't worry too much. Guess how all the professionals started... ;-) — Scheff's Cat, Jul 21 '19 at 07:51

Error in self written C function to split strings

1 Answers1