Home > Back-end >  How to read char/string one by one from a file and compare in C
How to read char/string one by one from a file and compare in C

Time:03-08

this is my first time asking questions here. I'm currently learning C and Linux at the same time. I'm working on a simple c program that use system call only to read and write files. My problem now is, how can I read the file and compare the string/word are the same or not. An example here like this:

foo.txt contains:

hi
bye
bye
hi
hi

And bar.txt is empty.

After I do:

./myuniq foo.txt bar.txt

The result in bar.txt will be like:

hi
bye
hi

The result will just be like when we use uniq in Linux.

Here is my code:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

#define LINE_MAX 256

int main(int argc, char * argv[]){
    int wfd,rfd;
    size_t n;
    char temp[LINE_MAX];
    char buf[LINE_MAX];
    char buf2[LINE_MAX];
    char *ptr=buf;

    if(argc!=3){
        printf("Invalid useage: ./excutableFileName readFromThisFile writeToThisFile\n");
        return -1;
    }

    rfd=open(argv[1], O_RDONLY);
    if(rfd==-1){
        printf("Unable to read the file\n");
        return -1;
    }

    wfd=open(argv[2], O_CREAT | O_WRONLY, S_IRUSR | S_IWUSR);
    if(wfd==-1){
        printf("Unable to write to the file\n");
        return -1;
    }

    while(n = read(rfd,buf,LINE_MAX)){
        write(wfd,buf,n);
    }

    close(rfd);
    close(wfd);
    return 0;
}

The code above will do the reading and writing with no issue. But I can't really figure out how to read char one by one in C style string under what condition of while loop.

I do know that I may need a pointer to travel inside of buf to find the next line '\n' and something like:

while(condi){
    if(*ptr == '\n'){
    strcpy(temp, buf);
    strcpy(buf, buf2);
    strcpy(buf2, temp);
}
else
    write(wfd,buf,n);

    *ptr  ;
}

But I might be wrong since I can't get it to work. Any feedback might help. Thank you.

And again, it only can be use system call to accomplish this program. I do know there is a easier way to use FILE and fgets or something else to get this done. But that's not the case.

CodePudding user response:

You only need one buffer that stores whatever the previous line contained.

The way this works for the current line is that before you add a character you test whether what you're adding is the same as what's already in there. If it's different, then the current line is marked as unique. When you reach the end of the line, you then know whether to output the buffer or not.

Implementing the above idea using standard input for simplicity (but it doesn't really matter how you read your characters):

int len = 0;
int dup = 0;
for (int c; (c = fgetc(stdin)) != EOF; )
{
    // Check for duplicate and store
    if (dup && buf[len] != c)
        dup = 0;
    buf[len  ] = c;

    // Handle end of line
    if (c == '\n')
    {
        if (dup) printf("%s", buf);
        len = 0;
        dup = 1;
    }
}

See here that we use the dup flag to represent whether a line is duplicated. For the first line, clearly it is not, and all subsequent lines start off with the assumption they are duplicates. Then the only possibility is to remain a duplicate or be detected as unique when one character is different.

The comparison before store is actually avoiding tests against uninitialized buffer values too, by way of short-circuit evaluation. That's all managed by the dup flag -- you only test if you know the buffer is already good up to this point:

if (dup && buf[len] != c)
    dup = 0;

That's basically all you need. Now, you should definitely add some sanity to prevent buffer overflow. Or you may wish to use a dynamic buffer that grows as necessary.

An entire program that operates on standard I/O streams, plus handles arbitrary-length lines might look like this:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    size_t capacity = 15, len = 0;
    char *buf = malloc(capacity);
    
    for (int c, dup = 0; (c = fgetc(stdin)) != EOF || len > 0; )
    {
        // Grow buffer
        if (len == capacity)
        {
            capacity = (capacity * 2)   1;
            char *newbuf = realloc(buf, capacity);
            if (!newbuf) break;
            buf = newbuf;
            dup = 0;
        }

        // NUL-terminate end of line, update duplicate-flag and store
        if (c == '\n' || c == EOF)
            c = '\0';
        if (dup && buf[len] != c)
            dup = 0;
        buf[len  ] = c;

        // Output line if not a duplicate, and reset
        if (!c)
        {
            if (!dup)
                printf("%s\n", buf);
            len = 0;
            dup = 1;
        }
    }

    free(buf);
}

Demo here: https://godbolt.org/z/GzGz3nxMK

CodePudding user response:

If you must use the read and write system calls, you will have to build an abstraction around them, as they have no notion of lines, words, or characters. Semantically, they deal purely with bytes.

Reading arbitrarily-sized chunks of the file would require us to sift through looking for line breaks. This would mean tokenizing the data in our buffer, as you have somewhat shown. A problem occurs when our buffer ends with a partial line. We would need to make adjustments so our next read call concatenates the rest of the line.

To keep things simple, instead, we might consider reading the file one byte at a time.

A decent (if naive) way to begin is by essentially reimplementing the rough functionally of fgets. Here we read a single byte at a time into our buffer, at the current offset. We end when we find a newline character, or when we would no longer have enough room in the buffer for the null-terminating character.

Unlike fgets, here we return the length of our string.

size_t read_a_line(char *buf, size_t bufsize, int fd)
{
    size_t offset = 0;

    while (offset < (bufsize - 1) && read(fd, buf   offset, 1) > 0)
        if (buf[offset  ] == '\n')
            break;

    buf[offset] = '\0';

    return offset;
}

To mimic uniq, we can create two buffers, as you have, but initialize their contents to empty strings. We take two additional pointers to manipulate later.

char buf[LINE_MAX] = { 0 };
char buf2[LINE_MAX] = { 0 };
char *flip = buf;
char *flop = buf2;

After opening our files for reading and writing, our loop begins. We continue this loop as long as we read a nonzero-length string.

If our current string does not match our previously read string, we write it to our output file. Afterwards, we swap our pointers. On the next iteration, from the perspective of our pointers, the secondary buffer now contains the previous line, and the primary buffer is overwritten with the current line.

Again, note that our initial previously read line is the empty string.

size_t length;

while ((length = read_a_line(flip, LINE_MAX, rfd))) {
    if (0 != strcmp(flip, flop))
        write(wfd, flip, length);

    swap_two_pointers(&flip, &flop);
}

Our pointer swapping function.

void swap_two_pointers(char **a, char **b) {
    char *t = *a;
    *a = *b;
    *b = t;
}

Some notes:

  • The contents of our file-to-be-read should never contains a line that would exceed LINE_MAX (including the newline character). We do not handle this situation, which is admittedly a sidestep, as this is the problem we wanted to avoid with the chunking method.
  • read_a_line should not be passed NULL or 0, to its first and second arguments. An exercise for the reader to figure out why that is.
  • read_a_line does not really handle read failing in the middle of a line.
  • Related