Home > Enterprise >  Remove white chars between commas, but not between what inside the commas
Remove white chars between commas, but not between what inside the commas

Time:05-10

I'm new to C and learning C90. I'm trying to parse a string into a command, But I have a hard time trying to remove white chars.

My goal is to parse a string like this:

NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1   

into this:

NA ME,NAME,123 456,124,14134,134. 134,1

so the white chars that were inside the arguments are still there, but the other white chars are removed.

I thought about using strtok, but I still want to keep the commas, even if there are multiple consecutive commas.

Until now I used:

void removeWhiteChars(char *s)
{
    int i = 0;
    int count = 0;
    int inNum = 0;
    while (s[i])
    {
        if (isdigit(s[i]))
        {
            inNum = 1;
        }
        if (s[i] == ',')
        {
            inNum = 0;
        }
        if (!isspace(s[i]) && !inNum)
            s[count  ] = s[i];
        else if (inNum)
        {
            s[count  ] = s[i];
        }

          i;
    }
    s[count] = '\0'; /* adding NULL-terminate to the string */
}

But it only skips for numbers and does not remove white chars after the number until the comma, and it's quite wrong.

i would appreciate any kind of help, I'm stuck on this one for two days now.

CodePudding user response:

You need to do lookaheads whenever you encounter possible skippable whitespace. The function below, every time it sees a space, checks forward if it ends with a comma. Likewise, for every comma, it checks and removes all following spaces.

// Remove elements str[index] to str[index len] in place
void splice (char * str, int index, int len) {
  while (str[index len]) {
    str[index] = str[index len];
    index  ;
  }
  str[index] = 0;
}

void removeWhiteChars (char * str) {
  int index=0, seq_len;

  while (str[index]) {
    if (str[index] == ' ') {
      seq_len = 0;

      while (str[index seq_len] == ' ') seq_len  ;

      if (str[index seq_len] == ',') {
        splice(str, index, seq_len);
      }
    }
    if (str[index] == ',') {
      seq_len = 0;
      while (str[index seq_len 1] == ' ') seq_len  ;

      if (seq_len) {
        splice(str, index 1, seq_len);
      }
    }
    index  ;
  }
}

CodePudding user response:

Below works, at least for your input string. I make absolutely no claims as to its efficiency or elegance. I did not try to modify s in place, instead wrote to a new string. The algorithm I followed was:

  • Initialized a startPos to 0.
  • Loop on s until you find a comma.
  • Backup from that position until you find the first non-space character.
  • memcpy from startPos to that position to a new string.
  • Add a comma to the next position of the new string.
  • Look forward from comma position until you find the first non-space character, set that to startPos.
  • Rinse and repeat
  • At the very end, append the final token with strcat
void removeWhiteChars(char *s)
{
    size_t i = 0;
    size_t len = strlen(s);
    char* newS = calloc(1, len);
    size_t newSIndex = 0;
    size_t startPos = 0;

    while (i<len)
    {
        // find the comma
        if (s[i] == ',')
        {            
            // find the first nonspace char before the comma
            ssize_t before = i-1;
            while (isspace(s[before]))
            {
                before--;
            }
            
            // copy from startPos to before into our new string
            size_t amountToCopy = (before-startPos) 1;
            memcpy(newS newSIndex, s startPos, amountToCopy);
            newSIndex  = amountToCopy;
            newS[newSIndex  ] = ',';

            // update startPos
            startPos = i 1;
            while (isspace(s[startPos]))
            {
                startPos  ;
            }
            
            // update i
            i = startPos 1;
        }
        else
        {
            i  ;
        }
    }

    // finally tack on the end
    strcat(newS, s startPos);

    // You can return newS if you're allowed to change your function
    // signature, or strcpy it to s
    printf("%s\n", newS);    
}

I have also only tested it with your input string, it may break for other cases.

Demonstration

CodePudding user response:

Please try this:

void removeWhiteChars(char *s)
{
    int i = 0;
    int count = 0;
    int isSomething = 0;
    while (s[i])
    {
        if (s[i] == ',' && isSomething == 0)
            isSomething = 2;
        else if (s[i] == ',' && isSomething == 1)
            isSomething = 2;
        else if (s[i] == ',' && isSomething == 2)
        {
            s[count  ] = ',';
            s[count  ] = s[i];
            isSomething = 0;
        }
        else if (isspace(s[i]) && isSomething == 0)
            isSomething = 1;
        else if (isspace(s[i]) && isSomething == 1)
            isSomething = 1;
        else if (isspace(s[i]) && isSomething == 2)
            isSomething = 2;
        else if (isSomething == 1)
        {
            s[count  ] = ' ';
            s[count  ] = s[i];
            isSomething = 0;
        }
        else if (isSomething == 2)
        {
            s[count  ] = ',';
            s[count  ] = s[i];
            isSomething = 0;
        }
        else
            s[count  ] = s[i];

          i;
    }
    s[count] = '\0'; /* adding NULL-terminate to the string */
}

CodePudding user response:

Here is one possible algorithm. It is not necessarily well-optimized as presented here, but exists to demonstrate one possible implementation of an algorithm. It is intentionally partially abstract.

The following is a very robust algorithm you may use to trim whitespace (among other things in general if you generalize it).

This implementation has not been verified to work as-is, however.

You should track the previous character and relevant spaces so that if you see { ',', ' ' }, you begin a chain, and a value representing the current path of execution. When you see any other character, the chain should break. We'll be defining a function:

// const char *const in: indicates intent to read from in only
void trim_whitespace(const char *const in, char *out, uint64_t const out_length);

We are defining a definite algorithm in which all execution paths are known, so for each unique possible state of execution, you should assign a numeric value increasing linearly beginning from zero using enums defined within the function for readability, and switch statements (unless goto and labels better models the behavior of the algorithm):

void trim_whitespace(const char *const in, char *out, uint64_t const out_length) {
    // better to use ifdefs first or avoid altogether with auto const variable,
    // but you get the point here without all that boilerplate
    #define CHAR_NULL 0

    enum {
        DEFAULT = 0,
        WHITESPACE_CHAIN
    } execution_state = DEFAULT;
    
    // track if loop is executing; makes the logic more readable;
    // can also detect environment instability
    // volatile: don't want this to be optimized out of existence
    volatile bool executing = true;

    while(executing) {
        switch(execution_state) {
        case DEFAULT:
            ...
        case WHITESPACE_CHAIN:
            ...
        default:
            ...
        }
    }

    // don't forget to undefine once finished so another function can use
    // the same macro name!
    #undef CHAR_NULL
}

The number of possible execution states is equal to 2**ceil(log_2(n)) where n is the number of actual execution states relevant to the operation of the current algorithm. You should explicitly name them and make cases for them in the switch statement.

In the DEFAULT case, we're only checking for commas. If the previous character was a comma, and the current character is a space, then we want to set the state to WHITESPACE_CHAIN.

In the WHITESPACE_CHAIN case, we test only if the previous character was whitespace, and the current character is non-whitespace.

The loop should look something like this:

...
// black boxing subjectives for portability, maintenance, and readability
bool is_whitespace(char);
bool is_comma(char);
...

volatile bool executing = true;

// previous character (only updated at loop start, line #LL)
char previous = CHAR_NULL;
// current character (only updated at loop start, line #LL)
char current = CHAR_NULL;
// writes to out if true at end of current iteration; doesn't write otherwise
bool write = false;

// current character index (only updated at loop end, line #LL)
uint64_t i = 0, j = 0;

while(executing) {
    previous = current;
    current = in[i];

    switch(execution_state) {
        case DEFAULT:
            if (!current) {
                executing = false;
                break;
            }

            if (is_comma(previous) && is_whitespace(current)) {
                execution_state = WHITESPACE_CHAIN;
                write = false;
            }
            
            break;

        case WHITESPACE_CHAIN:
            if (!current) {
                executing = false;
                break;
            }
            
            if (is_whitespace(previous) && !is_whitespace(current)) {
                execution_state = DEFAULT;
                write = true;
            }
            
            break;

        default:
            // impossible condition: unstable environment or SEU
            executing = true;
            out = NULL;
            return;
    }

    if (write) {
        out[j] = current;
          j;
    }

      i;
}

if (executing) {
    // memory error: unstable environment or SEU
    out = NULL;
} else {
    // execution successful
    return;
}

// end of function

Please kindly also use the word whitespace to describe these characters as that is what they are commonly known as, not "white chars".

CodePudding user response:

A short and reliable way to approach any parsing problem is to use a state-loop which is nothing more than a loop over all the characters in your original string where you use one (or more) flag variables to keep track of the state of anything you need to track. In your case here, you need to know the state of whether you are reading post (after) the comma.

This controls how you handle the next character. You will use a simple counter variable to keep track of the number of spaces you have read, and when you encounter the next character, if you are not post-comma, you append that number of spaces to your new string. If you are post-comma, you discard the buffered spaces. (you can use encountering the ',' itself as a flag that need not be kept in a variable).

To remove spaces around the ',' delimiter, you can write a rmdelimws() function that takes the new string to fill and the old string to copy from as arguments and do something similar to:

void rmdelimws (char *newstr, const char *old)
{
  size_t spcount = 0;               /* space count */
  int postcomma = 0;                /* post comma flag */
  
  while (*old) {                    /* loop each char in old */
    if (isspace (*old)) {           /* if space? */
      spcount  = 1;                 /* increment space count */
    }
    else if (*old == ',') {         /* if comma? */
      *newstr   = ',';              /* write to new string */
      spcount = 0;                  /* reset space count */
      postcomma = 1;                /* set post comma flag true */
    }
    else {                          /* normal char? */
      if (!postcomma) {             /* if not 1st char after comma */
        while (spcount--) {         /* append spcount spaces to newstr */
          *newstr   = ' ';
        }
      }
      spcount = postcomma = 0;      /* reset spcount and postcomma */
      *newstr   = *old;             /* copy char from old to newstr */
    }
    old  ;                          /* increment pointer */
  }
}

Putting it together is a short example you would have:

#include <stdio.h>
#include <ctype.h>

void rmdelimws (char *newstr, const char *old)
{
  size_t spcount = 0;               /* space count */
  int postcomma = 0;                /* post comma flag */
  
  while (*old) {                    /* loop each char in old */
    if (isspace (*old)) {           /* if space? */
      spcount  = 1;                 /* increment space count */
    }
    else if (*old == ',') {         /* if comma? */
      *newstr   = ',';              /* write to new string */
      spcount = 0;                  /* reset space count */
      postcomma = 1;                /* set post comma flag true */
    }
    else {                          /* normal char? */
      if (!postcomma) {             /* if not 1st char after comma */
        while (spcount--) {         /* append spcount spaces to newstr */
          *newstr   = ' ';
        }
      }
      spcount = postcomma = 0;      /* reset spcount and postcomma */
      *newstr   = *old;             /* copy char from old to newstr */
    }
    old  ;                          /* increment pointer */
  }
}


int main (void) {
  
  char str[] = "NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1   ",
       newstr[sizeof str] = "";
  
  rmdelimws (newstr, str);
  
  puts (str);
  puts (newstr);
}

Example Use/Output

$ ./bin/rmdelimws
NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1
NA ME,NAME,123 456,124,14134,134. 134,1
  • Related