How to remove duplicate lines in a file?-CodePudding

I understand that the general approach is to use something like

$ sort file1.txt | uniq > file2.txt

But I was wondering if there was a way to do this without needing separate source and destination files, even if it means it can't be a one-liner.

CodePudding user response：

With GNU awk for "inplace" editing:

awk -i inplace '!seen[$0]  ' file1.txt

As with all tools (except ed which requires the whole file to be read into memory first) that support "inplace" editing (sed -i, perl -i, ruby -i, etc.) this uses a temp file behind the scenes.

With any awk you can do the following with no temp files used but about twice the memory used instead:

awk '!seen[$0]  {a[  n]=$0} END{for (i=1;i<=n;i  ) print a[i] > FILENAME}' file

CodePudding user response：

With Perl's -i:

perl -i -lne 'print unless $seen{$_}  ' original.file

-i changes the file "in place";
-n reads the input line by line, running the code for each line;
-l removes newlines from input and adds them to print;
The %seen hash idiom is described in perlfaq4.

CodePudding user response：

A common idiom is:

temp=$(mktemp)
some_pipeline < original.file > "$temp" && mv "$temp" original.file

The && is important: if the pipeline fails, then the original file won't be overwritten with (perhaps) garbage.

The Linux moreutils package contains a program that encapsulates this away:

some_pipeline < original.file | sponge original.file

CodePudding user response：

Simply use the -o and -u options of sort:

sort -o file -u file

You don't need even to use a pipe for another command, such as uniq.

CodePudding user response：

Using sed

$ sed -i -n 'G;/^\(.*\n\).*\n\1$/d;H;P' input_file

G - Append hold space.
/^$.*\n$.*\n\1$/d - Using back-referencing, match and delete duplicated lines.
H - Copy pattern space to hold space.
P - Print the current pattern space up to the first newline.