Home > Software design >  Reading from text file in C with fgets: what's that carachter, if it's not \0 and \n?
Reading from text file in C with fgets: what's that carachter, if it's not \0 and \n?

Time:10-31

EDIT: I'll leave the question here for other people to read if they have the same problem; I've been adviced in the comments that the solution is line[strcspn(line,"\r\n")]= 0. In my Operating Systems course I was never told of \r, therefore you may have had the same issue and this might be useful to you, too.

So I've already read everything on stack overflow regarding how to get rid of \n carachter after reading from a text file using fgets.

In my C file, I have this written:

const char *ESCAPE= "1a2b3c4e5d";
FILE *FP= fopen(backup.txt);

Assume this is what I've got written in backup.txt: 1a2b3c4e5d\n Mark

As you can see, the first line is actually identical to ESCAPE, weren't it for the \n carachter. Now let's look at the code below, in which I try to identify "1a2b3c4e5d" in the file, and, after removing the \n carachter, do a strcmp:

char line[64];
while(fgets(line, sizeof(line), FP)){
   fprintf(stdout, "this is line lenght: %ld", strlen(line));
   // It prints 12
   line[strlen(line) -1]= 0; // Removing the new_line carachter;
   fprintf(stdout, "This is line after getting rid of new_line: %ld\n", strlen(line));
   // It prints 11.
   fprintf("This is ESCAPE lenght: %ld\n", strlen(ESCAPE));
   // It prints 10;

   if(strcmp(line, ESCAPE) == 0){
      fprintf(stdout, "I'm Here\n");
   }

The very first read of fgets will store in line "1a2b3c4e5d\n", which is lenght 12 according to him. Now, I read 10 carachters and the new_line one, which is 11, since the strlen does not count the null terminator. I expected it to be 11, thus the second print, after I removed \n, I expected lenght to be 10, instead it's 11.

This means there's something else inside the buffer, but I really don't understand what it is, and, of course, the strcmp will never be true, due to this misterious 11nth carachter. Do you have any idea what it is? And how can I solve it? Thanks!

I tried to look for every answer on Stackoverflow. Some even suggest using strcspn, which was a nice discover (it even solves some troubling situations with the buffer), but the code does not work in that situation, for some reason. I can't find an answer and therefore I asked this question.

CodePudding user response:

Long story. Short version being, in Unix, lines are ended by a \n. In windows they are terminated by \r\n See for example here.

Longer version is more complex. Now as a "unix lover windows-hater" old geek, I should tell you that windows is wrong and unix is right. But actually, \r\n also makes sense. From an historical point of view.

All that go back to times when output of a computer was a serial line connected to a printer. Not a fancy laser printer, but more of a electronicaly commanded typewriter. This printer was receiving a bunch of bytes (7-bits bytes, the 8th being used for parity), with a protocol. 41 meant "print a A". 48 meant "print a 0". That is the well known ASCII code. And some of those 128 (7 bits, again) numbers, meant something else than "print this". For example 7 meant "ring the bell" (like for microwave, so that someone come to see the result of the computation when it is ready :D). 8 mean go one character backward (for example to print something else over the previously printed character). Etc. And 10 meant (I say "meant". But all that still mean that. Just it makes more sense thinking of the very down to earth meaning it had then), go one line down. And 13 "go back to the beginning of the line".

So, to print "hello" on one line, then "world" on another line, you had to send bytes 104, 101, 108, 108, 111, 13, 10, 119, 111, 114, 108 100. Meaning "print h, e, l, l and o. Then send the head back to the beginning of the line (13), and feed the paper 1 line forward (10). And print w o r l d."

\n is just the char representation of 10 and \r of 13 in C (and then almost everywhere else). In C, '\n' is the exact same thing as 10. Exactly as 0xA is. Just 3 different ways to say the exact same thing.

So, now, some may (as in unix) claim that feeding the paper 1 line forward implies going back to the beginning of that new (and therefore so far empty) line. Some may say (as in windows) that if you just go 1 line forward (skip the 13 aka \r), without going back to the beginning of the line you should print

hello
     world

Some may even say (as mac people did once upon a time, before they became a variant of unix people) that \r (going to the beginning of the line) implies feeding a line forward.

I am not very young (I am closer to retirement that to the beginning of my career), and besides, I started coding very young (at 7). So it is more than 40 years that I am coding. And yet, I never knew that time when output were actual printers (I knew the physical green terminals, VT100 and its kind. But even those were already some sort of printer emulators, without the physical constraint of actually performing a move, and triggering actuators). So I am not sure who is really correct. I guess it depend on the printer. But I know that on my mechanical typing machine I once owned (and I think most of where likewise) the action of feeding one line forward and going back to the beginning was done in the same gesture. Though it was also possible to each of the 2 things separately. So, I suppose they are all right. Note that windows (and even MS-DOS) never knew neither that time. But it inherit itself from other, older systems, such as CP/M.

Also, I suppose that the consideration of memory and disk usage was also in the way, in favor of saying \n rather than \r\n (once upon a time, that would not have been a ridiculous parsimony. And windows was never known for parsimony...)

So, you see, it is not a recent debate. It is more a "width of US train comes from width of Roman horse's asses" story. But in the meantime, in 2022, world is still divided among systems in which a newline is coded by a 10, (aka \n) and systems in which it is coded by a 13 then a 10 (aka \r\n)

  • Related