Home > Software design >  splitting string and counting tokens in c
splitting string and counting tokens in c

Time:10-22

I have a text file that contains multiple strings that are different lengths that I need to split into tokens. Is it best to use strtok to split these strings and how can I count the tokens?

Example of strings from the file

Emma Stone#1169876#COMP242#COMP333#COMP336#COMP133#COMP231
Emma Watson#1169875#COMP336#COMP2421#COMP231#COMP338#CCOMP3351
Kevin Hart#1146542#COMP142#COMP242#COMP231#COMP336#COMP331#COMP334
George Clooney#1164561#COMP336#COMP2421#COMP231#COMP338#CCOMP3351
Matt Damon#1118764#COMP439#COMP4232#COMP422#COMP311#COMP338
Johnny Depp#1019876#COMP311#COMP242#COMP233#COMP3431#COMP333#COMP432

CodePudding user response:

Generally, using strtok is a good solution to the problem:

#include <stdio.h>
#include <string.h>

int main( void )
{
    char line[] =
        "Emma Stone#1169876#COMP242#COMP333#COMP336#COMP133#COMP231";

    char *p;
    int num_tokens = 0;

    p = strtok( line, "#" );

    while ( p != NULL )
    {
        num_tokens  ;

        printf( "Token #%d: %s\n", num_tokens, p );

        p = strtok( NULL, "#" );
    }
}

This program has the following output:

Token #1: Emma Stone
Token #2: 1169876
Token #3: COMP242
Token #4: COMP333
Token #5: COMP336
Token #6: COMP133
Token #7: COMP231

However, one disadvantage of using strtok is that it is destructive in the sense that it modifies the string, by replacing the # delimiters with terminating null characters. If you do not want this, then you can use strchr instead:

#include <stdio.h>
#include <string.h>

int main( void )
{
    const char *const line =
        "Emma Stone#1169876#COMP242#COMP333#COMP336#COMP133#COMP231";

    const char *p = line, *q;
    int num_tokens = 1;

    while ( ( q = strchr( p, '#' ) ) != NULL )
    {
        printf( "Token #%d: %.*s\n", num_tokens, q-p, p );
        num_tokens  ;
        p = q   1;
    }

    printf( "Token #%d: %s\n", num_tokens, p );
}

This program has identical output to the first program:

Token #1: Emma Stone
Token #2: 1169876
Token #3: COMP242
Token #4: COMP333
Token #5: COMP336
Token #6: COMP133
Token #7: COMP231

Another disadvantage with strtok is that it is not reentrant or thread-safe, whereas strchr is. However, some platforms provide a function strtok_r, which does not have these disadvantages. But that function does still has the disadvantage of being destructive.

CodePudding user response:

Yes, you should use strtok to split these strings.

On

how can I count the tokens

You can simply add a counter inside while and increment it by one in each iteration to get the total number of tokens.

#include <stdio.h>
#include <string.h>

int main(void) {

  char string[] = "Hello world this is a simple string";
  char *token = strtok(string, " ");
  int count = 0;

  while (token != NULL) {
    count  ;
    token = strtok(NULL, " ");
  }
  printf("Total number of tokens = %d", count);

  return 0;
}

CodePudding user response:

You can also write your own function to handle this quite trivial split:

char **split(char *str, char **argv, size_t *argc, const char delim)
{
    *argc = 0;
    if(*str && *str)
    {
        argv[0] = str;
        *argc = 1;
        while(*str)
        {
            if(*str == delim)
            {
                *str = 0;
                str  ;
                if(*str) 
                {
                    argv[*argc] = str;
                    *argc  = 1;
                    continue;
                }
            }
            str  ;
        }
    }
    return argv;
}


int main(void)
{
    char *argv[10];
    size_t argc;
    char str[] = "Emma Stone#1169876#COMP242#COMP333#COMP336#COMP133#COMP231";

    split(str, argv, &argc, '#');

    printf("Numner of substrings: %zu\n", argc);
    for(size_t i = 0; i < argc; i  )
        printf("token [%2zu] = `%s`\n", i, argv[i]);
}

https://godbolt.org/z/b1aarnfWs

Remarks: same as strtok it requires str to me modifiable. str will be modified.

  • Related