Manipulating a string and rewriting it by the function output-CodePudding

For some functions for string manipulation, I try to rewrite the function output onto the original string. I came up with the general scheme of

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

char *char_repeater(char *str, char ch)
{

  int tmp_len = strlen(str)   1; // initial size of tmp
  char *tmp = (char *)malloc(tmp_len); // initial size of tmp
  // the process is normally too complicated to calculate the final length here

  int j = 0;
  for (int i = 0; i < strlen(str); i  )
  {
    tmp[j] = str[i];
    j  ;
    if (str[i] == ch)
    {
      tmp[j] = str[i];
      j  ;
    }
    if (j > tmp_len)
    {
      tmp_len *= 2; // growth factor
      tmp = realloc(tmp, tmp_len);
    }
  }
  tmp[j] = 0;
  char *output = (char *)malloc(strlen(tmp)   1); 
   // output matching the final string length

  strncpy(output, tmp, strlen(tmp));
  output[strlen(tmp)] = 0;

  free(tmp); // Is it necessary?
  return output;
}

int main()
{

  char *str = "This is a test";
  str = char_repeater(str, 'i');

  puts(str);
  free(str);
  return 0;
}

Although it works on simple tests, I am not sure if I am on the right track.

Is this approach safe overall?
Of course, we do not re-write the string. We simply write new data (array of the characters) at the same pointer. If output is longer than str, it will rewrite the data previously written at str, but if output is shorter, the old data remains, and we would have a memory leak. How can we free(str) within the function before outputting to its pointer?

CodePudding user response：

A pair of pointers can be used to iterate through the string.
When a matching character is found, increment the length.
Allocate output as needed.
Iterate through the string again and assign the characters.
This could be done in place if str was malloced in main.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

char *char_repeater(char *str, char ch)
{

    int tmp_len = strlen(str)   1; // initial size of tmp
    char *find = str;

    while ( *find) // not at terminating zero
    {
        if ( *find == ch) // match
        {
            tmp_len  ; // add one
        }
          find; // advance pointer
    }
    char *output = NULL;
    if ( NULL == ( output = malloc(tmp_len)))
    {
        fprintf ( stderr, "malloc peoblem\n");
        exit ( 1);
    }
    // output matching the final string length
    char *store = output; // to advance through output
    find = str; // reset pointer
    while ( *find) // not at terminating zero
    {
        *store = *find; // assign
        if ( *find == ch) // match
        {
              store; // advance pointer
            *store = ch; // assign
        }
          store; // advance pointer
          find;
    }
    *store = 0; // terminate

    return output;
}

int main()
{

    char *str = "This is a test";
    str = char_repeater(str, 'i');

    puts(str);
    free(str);
    return 0;
}

CodePudding user response：

For starters the function should be declared like

char * char_repeater( const char *s, char c );

because the function does not change the passed string.

Your function is unsafe and inefficient at least because there are many dynamic memory allocations. You need to check that each dynamic memory allocation was successful. Also there are called the function strlen also too ofhen.

Also this code snippet

tmp[j] = str[i];
j  ;
if (str[i] == ch)
{
  tmp[j] = str[i];
  j  ;
}
if (j > tmp_len)
//...

can invoke undefined behavior. Imagine that the source string contains only one letter 'i'. In this case the variable tmp_len is equal to 2. So temp[0] will be equal to 'i' and temp[1] also will be equal to 'i'. In this case j equal to 2 will not be greater than tmp_len. As a result this statement

tmp[j] = 0;

will write outside the allocated memory.

And it is a bad idea to reassign the pointer str

char *str = "This is a test";
str = char_repeater(str, 'i');

As for your question whether you need to free the dynamically allocated array tmp

free(tmp); // Is it necessary?

then of course you need to free it because you allocated a new array for the result string

char *output = (char *)malloc(strlen(tmp)   1);

And as for your another question

but if output is shorter, the old data remains, and we would have a memory leak. How can we free(str) within the function before outputting to its pointer?

then it does not make a sense. The function creates a new character array dynamically that you need to free and the address of the allocated array is assigned to the pointer str in main that as I already mentioned is not a good idea.

You need at first count the length of the result array that will contain duplicated characters and after that allocate memory only one time.

Here is a demonstration program.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char * char_repeater( const char *s, char c )
{
    size_t n = 0;

    for ( const char *p = s; ( p = strchr( p, c ) ) != NULL;   p )
    {
          n;
    }

    char *result = malloc( strlen( s )   1   n );

    if ( result != NULL )
    {
        if ( n == 0 )
        {
            strcpy( result, s );
        }
        else
        {
            char *p = result;

            do
            {
                *p   = *s;
                if (*s == c ) *p   = c;
            } while ( *s   );
        }
    }

    return result;
}

int main( void ) 
{
    const char *s = "This is a test";

    puts( s );

    char *result = char_repeater( s, 'i' );

    if ( result != NULL ) puts( result );

    free( result );
}

The program output is

This is a test
Thiis iis a test

CodePudding user response：

My kneejerk reaction is to dislike the design. But I have reasons.

First, realloc() is actually quite efficient. If you are just allocating a few extra bytes every loop, then chances are that the standard library implementation simply increases the internal bytecount value associated with your memory. Caveats are:

Interleaving memory management.
Your function here doesn’t have any, but should you start calling other routines then keeping track of all that becomes an issue. Anything that calls other memory management routines can lead to the next problem:
Fragmented memory.
If at any time the available block is too small for your new request, then a much more expensive operation to obtain more memory and copy everything over becomes an issue.

Algorithmic issues are:

Mixing memory management in increases the complexity of your code.
Every occurrence of c invokes a function call with potential to be expensive. You cannot control when it is expensive and when it is not.
Worst-case options (char_repeater( "aaaaaaaaaa", 'a' )) trigger worst-case potentialities.

My recommendation is to simply make two passes.

This passes several smell tests:

Algorithmic complexity is broken down into two simpler parts:
- counting space required, and
- allocating and copying.
Worst-case scenarios for allocation/reallocation are reduced to a single call to malloc().
Issues with very large strings are reduced:
- You need at most space for 2 large strings (not 3, possibly repeated)
- Page fault / cache boundary issues are similar (or the same) for both methods

Considering there are no real downsides to using a two-pass approach, I think that using a simpler algorithm is reasonable. Here’s code:

#include <stdio.h>
#include <stdlib.h>

char * char_repeater( const char * s, char c )
{
  // FIRST PASS

  // (1) count occurances of c in s
  size_t number_of_c = 0;
  const char * p = s;
  while (*p) number_of_c  = (*p   == c);

  // (2) get strlen s
  size_t length_of_s = p - s;

  // SECOND PASS

  // (3) allocate space for the resulting string
  char * dest = malloc( length_of_s   number_of_c   1 );

  // (4) copy s -> dest, duplicating every occurance of c
  if (dest)
  {
    char * d = dest;
    while (*s)
      if ((*d   = *s  ) == c)
        *d   = c;
    *d = '\0';
  }
  return dest;
}

int main(void)
{
  char * s = char_repeater( "Hello world!", 'o' );
  puts( s );
  free( s );
  return 0;
}

As always, know your data

Whether or not a two-pass approach actually is better than a realloc() approach depends on more factors than what is evident in a posting on the internet.

Nevertheless, I would wager that for general purpose strings that this is a better choice.

But, even if it isn’t, I would argue that a simpler algorithm, splitting tasks into trivial sub-tasks, is far easier to read and maintain. You should only start making tricky algorithms only if you have use-case profiling saying you need to spend more attention on it.

Without that, readability and maintainability trumps all other concerns.