Home > Enterprise >  Why think in terms of buffers and not lines? And can fgets() read multiple lines at one time?
Why think in terms of buffers and not lines? And can fgets() read multiple lines at one time?

Time:04-20

Recently I have been reading Unix Network Programming Vol.1. In section 3.9, the last two paragraphs above Figure 3.18 said, here I quote:

...But our advice is to think in terms of butters and not lines, Write your code to read butters of data, and if a line is expected, check the buffer to see if it contains that line.

And in the next paragraph, the authors gave a more specific example, here I quote:

...as we'll see in Section 6.3. System functions like select still won't know about readline's internal buffer, so a carelessly written program could easily find itself waiting in select for data already received and stored in readline's butters.

In section 6.5, the actual problem is "mixing of stdio and select()", which would make the program, here I quote the book, "error-prone". But how?

I know that the authors gave the answer later in the same section and according to my understanding to the book, it is because of the data being hidden from select() and thus select() could not know that the data that has been read is consumed or not.

The answer is literally there, but the first problem here is that I really have a hard time getting it, I cannot imagine what damage would it make to the program, maybe I need a demo program that suffers from the problem to help me understand it.

Still in section 6.5, the authors tried to explain the problem further by giving, here I quote:

... Consider the case when several lines of input are available from the standard input. select will cause the code at line 20 to read the input using fgets and that, in turn, will read the available lines into a buffer used by stdio. But, fgets only returns a single line and leaves any remaining data sitting in the stdio buffer ...

The "line 20" mentioned above is:

if (Fgets(sendline, MAXLINE, fp) == NULL)

where sendline is an array of char and fp is a pointer to FILE. I looked up into the detailed implementation of Fgets, and it just wrapped fgets() up with some extra error-dealing logic and nothing more.

And here comes my second question, how does fgets manage to, here I quote again, read the available lines? I mean, I looked up the man-page of fgets, it says fgets normally stops on the first newline character. Doesn't this mean that only one line would be read by fgets? More specifically, if I type one line in the terminal and press the enter key, then fgets reads this exact line. I do this again, then the next new line is read by fgets, and the point is one line at a time.

Thanks for your patience in reading all the descriptions, and looking forward to your answers.

CodePudding user response:

One of the main reasons to think about buffers rather than lines (when it comes to network programming) is because TCP is a streaming protocol, where data is just a stream of bytes beginning with a connection and ending with a disconnection.

There are no message boundaries, and there are no "lines", except what the application-level protocol on top of TCP have decided.

That makes it impossible to read a "line" from a TCP connection, there are no such primitive functions for it. You must read using buffers. And because of the streaming and the lack of any kind of boundaries, a single call to receive data may give your application less than you ask for, and it may be a partial application-level message. Or you might get more than a single message, including a partial message at the end.

Another note of importance is that sockets by default are blocking, so a socket that don't have any data ready to be received will cause any read call to block, and wait until there are data. The select call only tells if the read call won't block right now. If you do the read call multiple times in a loop it can (and ultimately will) block when the data to receive is exhausted.

All this makes it really hard to use high-level functions like fgets (after a fdopen call of course) to read data from TCP sockets, as it can block at any time if you use blocking socket. Or it can return with a failure if you use non-blocking sockets and the read call returns with the failure that it would block (yes that is returned as an error).

If you use your own buffering, you can use select in the same loop as read or recv, to make sure that the call won't block. Or of you use non-blocking sockets you can gather data (and append to your buffer) with single read calls, and add detection when you have a full message (either by knowing its length or by detecting the message terminator or separator, like a newline).


As for fgets reading "multiple lines", it can cause the underlying reads to fill the buffers with multiple lines, but the fgets function itself will only fill your supplied buffer with a single line.

fgets will never give you multiple lines.

CodePudding user response:

select is a Linux kernel call. It will tell you if the Linux kernel has data that your process hasn't received yet.

fgets is a C library call. To reduce the number of Linux kernel calls (which are typically slower) the C library uses buffering. It will try to read a big chunk of data from the Linux kernel (typically something like 4096 bytes) and then return just the part you asked for. Next time you call it, it will see if it already read the part you asked for, and then it won't need to read it from the kernel. For example, if it's able to read 5 lines at once from the kernel, it will return the first line, and the other 4 will be stored in the C library and returned in the next 4 calls.

When fgets reads 5 lines, returns 1, and stores 4, the Linux kernel will see that all the data has been read. It doesn't know your program is using the C library to read the data. Therefore, select will say there is no data to read, and your program will get stuck waiting for the next line, even though there already is one.


So how do you resolve this? You basically have two options: don't do buffering at all, or do your own buffering so you get to control how it works.

Option 1 means you read 1 byte at a time until you get a \n and then you stop reading. The kernel knows exactly how much data you have read, and it will be able to accurately tell you whether there's more data. However, making a kernel call for every single byte is relatively slow (measure it) and also, the computer on the other end of the connection could cause your program to freeze simply by not sending a \n at all.

I want to point out that option 1 is completely viable if you are just making a prototype. It does work, it's just not very good. If you try to fix the problems with option 1, you will find the only way to fix them is to do option 2.

Option 2 means doing your own buffering. You keep an array of say 4096 bytes per connection. Whenever select says there is data, you try to fill up the array as much as possible, and you check whether there is a \n in the array. If so, you process that line, remove the line from the array*, and repeat. This means you minimize kernel calls, and you also won't freeze if the other computer doesn't send a \n since the unfinished line will just stay in the array. If all 4096 bytes are used, and there is still no \n, you can either choose to process it as a big line (if this makes sense, e.g. in a chat program) or you can disconnect the connection, since the other computer is breaking the rules. Of course you can choose to use a bigger number than 4096.

* Extra for experts: "removing the line from the array" can be fast if you implement a "circular buffer" data structure.

  • Related