In a text file, I want to remove duplicates spanning two lines. Meaning in four consecutive lines the first two are the same as the last two. I only want to keep the first (or last) two lines. I want to preserve the order of lines in the file.
Example
Consider a file input.txt
where foo\nbar
is repeated and baz\nboo
is repeated, each in consecutive two-line blocks.
1
foo
bar
foo
bar
2
3
baz
boo
baz
boo
4
desired contents:
1
foo
bar
2
3
baz
boo
4
Things tried / considered: uniq
, sed
The same task is fairly simple for removing single line duplicates: uniq input.txt
.
I also had a look at sed
, but couldn't get it to work.
CodePudding user response:
If you want to accept a perl
solution then:
perl -0777 -pe 's/(. \R. \R)\1/$1/g' file
1
foo
bar
2
3
baz
boo
4
CodePudding user response:
With GNU sed for -E
(EREs) and -z
(read the whole file into memory):
$ sed -Ez 's/((.*\n){2})\1/\1/g' file
1
foo
bar
2
3
baz
boo
4
I also think you need GNU sed for the backreference in the regexp as I don't think that's part of POSIX but I'm not 100% sure on that one.