Home > front end >  How do you count the frequency of which a word of n length occurs within a string
How do you count the frequency of which a word of n length occurs within a string

Time:03-28

I have this code here that correctly formats the hard-coded sentence and finds the frequency of which a certain letter shows up in that string:

#include <stdio.h>
#include <string.h>

int main() {
    char words[1000][100];
    int x = 0, y;

    char myString[10000] = "The quick Brown ? Fox ? jumps over the Lazy Dog and the !##! LAZY DOG is still sleeping";
    printf("Original Text:\n");
    printf("%s\n", myString);
   
    // Function for uppercase letters to become lowercase and to remove special characters
    for (x = 0; x <= strlen(myString);   x) {
        if (myString[x] >= 65 && myString[x] <= 90)
            myString[x] = myString[x]   32;
    }
    for (x = 0; myString[x] != '\0';   x) {
        while (!(myString[x] >= 'a' && myString[x] <= 'z') &&
               !(myString[x] >= 'A' && myString[x] <= 'Z') &&
               !(myString[x] >= '0' && myString[x] <= '9') &&
               !(myString[x] == '\0') && !(myString[x] == ' ')) {
            for (y = x; myString[y] != '\0';   y) {
                myString[y] = myString[y   1];
            }
            myString[y] = '\0';
        }
    }
   
    printf("\nModified Text: \n%s\n", myString);

    // Part A
    int counts[26] = { 0 };
    int k;
    size_t myString_length = strlen(myString);

    for (k = 0; k < myString_length; k  ) {
        char c = myString[k];
        if (!isalpha(c))
            continue;
        counts[(int)(c - 'a')]  ;
    }
   
    printf("\nLetter\tCount\n------  -----\n");
    
    for (k = 0; k < 26;   k) {
        printf("%c\t%d\n", k   'a', counts[k]);
    }

    // Part B
    int i = 0, count = 0, occurrences[10000] = { 0 };
 
    while (myString[i] != '\0') {
        char wordArray[100];
        int j = 0;
       
        while (myString[i] != ' ' && myString[i] != '\0') {
            wordArray[j  ] = myString[i  ];
        }
     
        if (wordArray[j - 1] == ',' || wordArray[j - 1] == '.') {
            wordArray[j - 1] = '\0';
        }

        wordArray[j] = '\0';

        int status = -1;
    
        for (j = 0; j < count;   j) {
            if (strcmp(words[j], wordArray) == 0) {
                status = j;
                break;
            }
        }
    
        if (status != -1) {
            occurrences[status]  = 1;
        } else {
            occurrences[count]  = 1;
            strcpy(words[count  ], wordArray);
        }
          i;
    }
 
    printf("\nWord Length\tOccurrences\n-----------     -----------\n");
 
    for (i = 0; i < count;   i) {
        // print each word and its occurrences
        printf("%s\t\t%d\n", words[i], occurrences[i]);
    }
}

Part B is where I'm having a problem though, I want the code to be able to tell me the occurrence of which a word of a specific length shows up, such as this instance:

Word length Occurrences
1           0
2           1

Here, there are no instances where there is a word with one character, but there is one instance where there is a word with two characters. However, my code is outputting the number of times a specific word is given and not what I want above, like this:

Word Length     Occurrences
-----------     -----------
the             3
quick           1
brown           1
                3
fox             1
jumps           1
over            1
lazy            2
dog             2
and             1
is              1
still           1
sleeping                1

How would I go about changing it so that it shows the output I want with just the word length and frequency?

CodePudding user response:

Here are some remarks about your code:

  • the first loop recomputes the length of the string for each iteration: for (x = 0; x <= strlen(myString); x). Since you modify the string inside the loop, it is difficult for the compiler to ascertain that the string length does not change, so a classic optimisation may not work. Use the same test as for the next loop:

      for (x = 0; myString[x] != '\0';   x)
    
  • the test for uppercase is not very readable because you hardcode the ASCII values of the letters A and Z, you should either write:

      if (myString[x] >= 'A' && myString[x] <= 'Z')
          myString[x]  = 'a' - 'A';
    

    or using macros from <ctype.h>:

      unsigned char c = myString[x];
      if (isupper(c))
          myString[x] = tolower(c);
    

    or equivalently and possibly more efficiently:

      myString[x] = tolower((unsigned char)myString[x]);
    
  • in the second loop, you remove characters that are neither letters, digits nor spaces. You have a redundant nested while loop and a third nexted loop to shift the rest of the array for each byte removed: this method has cubic time complexity, very inefficient. You should instead use a two finger method that operates in linear time:

      for (x = y = 0; myString[x] != '\0';   x) {
          unsigned char c = myString[x];
          if (!isalnum(c) && c != ' ') {
              myString[y  ] = c;
          }
      }
      myString[y] = '\0';
    
  • note that this loop removes all punctuation instead of replacing it with spaces: this might glue words together such as "a fine,good man" -> "a finegood man"

  • In the third loop, you use a char value c as an argument for isalpha(c). You should include <ctype.h> to use any function declared in this header file. Functions and macros from <ctype.h> are only defined for all values of the type unsigned char and the special negative value EOF. If type char is signed on your platform, isalpha(c) would have undefined behavior if the string has negative characters. In your particular case, you filtered characters that are not ASCII letters, digits or space, so this should not be a problem, yet it is a good habit to always use unsigned char for the character argument to isalpha() and equivalent functions.

  • Note also that this counting phase could have been combined into the previous loops.

  • to count the occurrences of words, the array occurrences should have the same number of elements as the words array, 1000. You do not check for boundaries so you have undefined behavior if there are more than 1000 different words and/or if any of these words has 100 characters or more.

  • in the next loop, you extract words from the string, incrementing i inside the nested loop body. You also increment i at the end of the outer loop, hence skipping the final null terminator. The test while (myString[i] != '\0') will test bytes beyond the end of the string, which is incorrect and potential undefined behavior.

  • to avoid counting empty words in this loop, you should skip sequences of spaces before copying the word if not at the end of the string.

Here is a modified version:

#include <ctype.h>
#include <stdio.h>
#include <string.h>

#define WORD_COUNT 1000
#define WORD_LENGTH 100

int main() {
    char words[WORD_COUNT][WORD_LENGTH];
    int occurrences[WORD_COUNT] = { 0 };
    int counts['z' - 'a'   1] = { 0 };
    unsigned char c;
    int x, y;

    char myString[10000] = "The quick Brown ? Fox ? jumps over the Lazy Dog and the !##! LAZY DOG is still sleeping";
    printf("Original Text:\n");
    printf("%s\n", myString);
   
    // Function for uppercase letters to become lowercase and to remove special characters
    for (x = y = 0; myString[x] != '\0';   x) {
        c = myString[x];
        c = tolower(c);
        if (c >= 'a' && c <= 'z') {
            counts[c - 'a']  = 1;
        }
        if (isalnum(c) || c == ' ') {
            myString[y  ] = c;
        }
    }
    myString[y] = '\0';
   
    printf("\nModified Text: \n%s\n", myString);

    // Part A
    printf("\nLetter\tCount\n------  -----\n");
    for (c = 'a'; c <= 'z'; c  ) {
        printf("%c\t%d\n", c, counts[c - 'a']);
    }

    // Part B
    int i = 0, count = 0;
 
    for (i = 0;;) {
        char wordArray[WORD_LENGTH];
        int j;
       
        while (myString[i] == ' ') {
            i  ;
        }
        if (myString[i] == '\0') {
            break;
        }
        for (j = 0; i < sizeof(wordArray) - 1 && myString[i] != ' ' && myString[i] != '\0'; j  ) {
            wordArray[j] = myString[i  ];
        }
        wordArray[j] = '\0';

        int index = -1;
        for (j = 0; j < count;   j) {
            if (strcmp(words[j], wordArray) == 0) {
                index = j;
                break;
            }
        }
    
        if (index != -1) {
            occurrences[index]  = 1;
        } else
        if (count >= WORD_COUNT) {
            printf("too many different words, ignoring %s\n", wordArray);
        } else {
            index = count  ;
            strcpy(words[index], wordArray);
            occurrences[index]  = 1;
        }
    }
 
    printf("\nWord Length\tOccurrences\n-----------     -----------\n");
 
    for (i = 0; i < count;   i) {
        // print each word and its occurrences
        printf("%s\t\t%d\n", words[i], occurrences[i]);
    }
    return 0;
}

CodePudding user response:

Use strtok_r() and simplify counting.
It's sibling strtok() is not thread-safe. Discussed in detail in Why is strtok() Considered Unsafe?

Also, strtok_r() chops input string by inserting \0 chars inside the string. If you want to keep a copy of original string, you have to make a copy of original string and pass it on to strtok_r().

There is also another catch. strtok_r() is not a part of C-Standard yet, but POSIX-2008 lists it. GNU glibc implements it, but to access this function we need to #define _POSIX_C_SOURCE before any includes in our source files.

There is also strdup() & strndup() which duplicate an input string, they allocate memory for you. You've to free that string-memory when you're done using it. strndup() was added in POSIX-2008 so we declare 200809L in our sources to use it.

It's always better to use new standards to write fresh code. POSIX 200809L is recommended with at least C standard 2011.

#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define MAX_STR_LEN     1024
#define MAX_WORD_LEN    128
#define WORD_DELIMS     " \n\t"

int is_word (const char* str, const size_t slen) {
    int word = 0;
    for (size_t ci = 0; ci < slen;)
        if (isalnum (str[ci  ])) {
            word = 1;
            break;
        }
    return word;
}

void get_word_stat (const char* str, int word_stat[]) {
    char *copy = strndup (str, MAX_STR_LEN); // limiting copy
    if (!copy) { // copying failed
        printf ("Error duplicating input string\n");
        exit (1);
    }
    for (char *token, *rmdStr = copy; (token = strtok_r (NULL, WORD_DELIMS, &rmdStr)); /* empty */) {
        size_t token_len = strlen (token);
        if (token_len > (MAX_WORD_LEN - 1)) {
            printf ("Error: Increase MAX_WORD_LEN(%d) to handle words of length %lu\n", MAX_WORD_LEN, token_len);
            exit (2);
        }
        if (is_word (token, token_len))
              word_stat[token_len];
        else
            printf ("[%s] not a word\n", token);
    }
    free (copy);
}

int main () {
    char str [MAX_STR_LEN] = "The quick Brown ? Fox ? jumps over the Lazy Dog and the !##! LAZY DOG is still sleeping";
    printf ("Original Text: [%s]\n", str);

    int word_stat[MAX_WORD_LEN] = {0};
    get_word_stat (str, word_stat);

    printf ("\nWordLength   Occurrences\n");
    for (int si = 1; si < MAX_WORD_LEN;   si) {
        if (word_stat[si])
            printf ("%d\t\t%d\n", si, word_stat[si]);
    }
    return 0;
}
  • Related