Edge case in C program to find and replace words from a text file-CodePudding

I'm new to C and would greatly appreciate some help in fixing a bug in my program.

I've identified an edge case and I'm not quite sure how to resolve it.

Currently, the function will find and replace words and words within words given a text file. For instance, changing 'water' to 'snow' will change the string 'waterfall' into 'snowfall'. This is an intended result.

However, when I input 'waterfalls' to change the word 'waterfall', the program seems to get stuck in an endless loop. I'm not quite sure why but I would appreciate if anyone could point me in the right direction.

Here is my code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>

#define BUFFER_SIZE 20

void replaceWord(char *str, const char *oldWord, const char *newWord)
{
    char *position, buffer[BUFFER_SIZE];
    int index, oldWordLength;
    oldWordLength = (long)strlen(oldWord);
    while ((position = strstr(str, oldWord)) != NULL)
    {
        strcpy(buffer, str);
        index = position - str;
        str[index] = '\0';
        strcat(str, newWord);
        strcat(str, buffer   index   oldWordLength);
    }
}

int main()
{
    char msg[100] = "This is some text with the word snowfall to replace";
    puts(msg);
    replaceWord(msg, "snowfall", "snowfalls");
    puts(msg);
    return 0;
}

CodePudding user response：

Ok. First, you have a severely undersized buffer. This:

char buffer[BUFFER_SIZE];

is the target of what ends up being a full-string copy of the original message. But in main, the original message:

char msg[100] = "This is some text with the word snowfall to replace";

is 51 characters wide (not including the terminator). That's not going to work, and running through a debug address sanitizer (or a regular debugger, ideally) would show that. :

==1==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffd578054d4 at pc 0x7f7457e4c846 bp 0x7ffd57805460 sp 0x7ffd57804c10
WRITE of size 52 at 0x7ffd578054d4 thread T0
    #0 0x7f7457e4c845  (/opt/compiler-explorer/gcc-11.2.0/lib64/libasan.so.6 0x55845)
    #1 0x4012a7 in replaceWord /app/example.c:15
    #2 0x401592 in main /app/example.c:27
    #3 0x7f7457c2c0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6 0x270b2)
    #4 0x40112d in _start (/app/output.s 0x40112d)

Address 0x7ffd578054d4 is located in stack of thread T0 at offset 52 in frame
    #0 0x4011f5 in replaceWord /app/example.c:9

So that clearly isn't any good. Fixing that by increasing the buffer size will "work", but to really do this right the source/target buffer (in your function they're one in the same) should have its full writable width (and that includes space for a terminator) provided as an argument (which you should definitely do.

Second, the code that does this:

while ((position = strstr(str, oldWord)) != NULL)

is always starting the search for oldWord from the beginning of the input string. that's wrong (well, it's right exactly once; the first time through; after that, it's wrong). Consider this:

i love c

Suppose I look or i and I want to replace it with is. I would find it here:

i love c  
^

After replacement my new string I'm building looks like this:

is love c

So where do you know to start the next search from? You start at the position where the original string started, plus the length of the replacement string value. The original was in pos 0. the length of the replacement was 2, so we start the next search at pos 2.

is love c  
  ^

Note that this gets a LOT more complicated when you're doing everything in-place (e.g. no intermediate buffer), but that doesn't seem to be a goal for you right now, and is likely off your radar. Therefore, a not-very-efficient, but functional way to do this is:

Start at the beginning of the string (src = str)
Search for the old word, starting at src.
If found, copy the original string up-to, but not including, the old word to a buffer.
Append the replacement string to the buffer.
Append the remainder of the original string passed the old word to the buffer.
Copy the buffer back to the source string.
Reposition src to be the length of the replacement word past the original position of the found old word from (3)
Loop back to (2) until the old word is no longer found.

As I said; not very efficient, but it is fairly easy to understand. It looks like this in code. Note the buffer size was significantly increased, and is used to declare both the temp buffer and the array in main. This still isn't good, but it was what you brought, so I'm sticking with it. I urge you to consider doing this algorithm with dynamic memory management, or size-restrictions passed as additional argument(s):

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>

#define BUFFER_SIZE 100

void replaceWord(char *str, const char *oldWord, const char *newWord)
{
    char buffer[ BUFFER_SIZE ];
    char *src = str;
    char *oldpos = NULL;

    size_t lenOldWord = strlen(oldWord);
    size_t lenNewWord = strlen(newWord);

    while ((oldpos = strstr(src, oldWord)) != NULL)
    {
        // 1. copy everything up to the old word.
        // 2. append the new word
        // 3. copy the remainder of source string *past* the old word
        // 4. copy back to the original string.
        memcpy(buffer, str, (size_t)(oldpos - str));
        memcpy(buffer   (oldpos - str), newWord, lenNewWord);
        strcpy(buffer   (oldpos - str)   lenNewWord, oldpos   lenOldWord);
        strcpy(str, buffer);

        // the new starting point will be the previous discovry
        //  location plus the length of the new word.
        src = oldpos   lenNewWord;
    }
}

int main()
{
    char msg[BUFFER_SIZE] = "This is some text with the word snowfall to replace";
    puts(msg);
    replaceWord(msg, "snowfall", "snowfalls");
    puts(msg);
    return 0;
}

Output

This is some text with the word snowfall to replace
This is some text with the word snowfalls to replace

I strongly suggest you to run this in a debugger and watch how it works, step by step. It will help you understand where you were missing the start-search-here logic. And I urge you even more, as an exercise, address the blatant vulnerability of this algorithm. think of ways you could easily invoke undefined behavior (hint: short, common old word, very very long replacement word), and the things you should to do address those problems.