Home > Software design >  Implementing custom uniq in linux shell development
Implementing custom uniq in linux shell development

Time:11-09

I'm developing a custom shell. In this assignment, I need to implement uniq-like command. Given sorted lines, uniq should be able to print all unique values and (their number of occurences if the command is uniq -c). Example code is stated at the very end.

I have no problem with the algorithm. I wrote a function which can do take exactly same operation with desired one. However, the problem is that, what are these types of outputs and inputs? I mean when I command cat input.txt, are these lines just one string or are they given in array? As I said, algorithm is ok but I do not know how to apply that correct algorithm in the shell? Any help or idea is appreciated.

$cat input.txt
Cinnamon
Egg
Egg
Flour
Flour
Flour
Milk
Milk
$cat input.txt | uniq
Cinnamon
Egg
Flour
Milk

CodePudding user response:

are these lines just one string or are they given in array

These lines are the result of a fork, just strings that have been sent to stdout.

getline is very useful in these cases, now that you have the algorithm, you only have to process the output of cat.

An example:

#define _POSIX_C_SOURCE 200809L

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>

int main(void)
{
    char *str = NULL;
    size_t size = 0;
    ssize_t len = 0;
    int line = 1;

    while ((len = getline(&str, &size, stdin)) != -1)
    {
        printf("-) length = %2zd | %s", line  , len, str);
    }
    free(str);
    return 0;
}

gcc -o demo demo.c
cat demo.c | ./demo

Output:

 1) length = 32 | #define _POSIX_C_SOURCE 200809L
 2) length =  1 | 
 3) length = 19 | #include <stdio.h>
 4) length = 20 | #include <stdlib.h>
 5) length = 23 | #include <sys/types.h>
 6) length =  1 | 
 7) length = 15 | int main(void)
 8) length =  2 | {
 9) length = 22 |     char *str = NULL;
10) length = 21 |     size_t size = 0;
11) length = 21 |     ssize_t len = 0;
12) length = 18 |     int line = 1;
13) length =  1 | 
14) length = 54 |     while ((len = getline(&str, &size, stdin)) != -1)
15) length =  6 |     {
16) length = 61 |         printf("-) length = %2zd | %s", line  , len, str);
17) length =  6 |     }
18) length = 15 |     free(str);
19) length = 14 |     return 0;
20) length =  2 | }
21) length =  1 | 

CodePudding user response:

The format is a sequence of bytes. How you store that is up to you. Your design choice.

cat input.txt opens the file input.txt and reads the bytes and sends them to the "screen" (standard output).

uniq reads bytes from the "keyboard" (standard input) and... does the unique stuff. And sends the output to the "screen". You can try this yourself if you want - just run uniq by itself, or uniq -c - to stop the command and make it process the last line, press enter to finish the line, then Ctrl-D.

When you do cat input.txt | uniq the shell runs cat input.txt, and it runs uniq, but it redirects cat's "screen" to uniq's "keyboard". So it's like you run cat input.txt and then whatever it displays, you type that into uniq.

As I understand it you are writing a "pretend" shell, and yours won't actually run the two commands and connect them, so you aren't interested in how to do that, only how to simulate it.

Something to be aware of is that the bytes go straight from cat to uniq. It doesn't save them all into a stream first. Therefore if the first command was a slow one, uniq would be able to process the lines as soon as they were ready, and it wouldn't have to wait for the first command to finish before it could start doing the uniq stuff. With your cat command, you can't tell the difference, unless the file is really big and won't fit in a string, but you may notice it with other commands.

For your pretend shell it might be simplest to process one line at a time through all the commands in order.

CodePudding user response:

when I command cat input.txt, are these lines just one string or are they given in array?

If you are executing the external cat command* then the output is written to that command's standard output. This is I/O, not shared memory. Once those data emerge from cat, it is no longer appropriate to characterize them in terms of whatever internal data structure cat used for them. They are just a sequence of characters. If another command consumes those data then it chooses its own data structures for handling them.

And how would your uniq consume those data? One of two ways:

  1. The output of cat would be redirected to a file, which uniq would afterward open and read.

    cat input.txt > temp; uniq temp
    

    OR

  2. The output of cat would be redirected to the standard input of uniq.

    cat input.txt | uniq
    

It is one of the organizing principles of UNIX that every I/O endpoint is logically a file, and therefore can be handled more or less the same. In case (1) you would open() or fopen() the named file, whereas in case (2) you would use the preconnected file descriptor 0 or stdin stream, but once you decide which of those to use, it's the same either way.


*And if you are executing your own internal cat then you know the details better than we do.

  • Related