I have a very large text file with 2 columns and more than 10 mio of lines. Most lines have in column 2 a number that is the number of column 2 of the previous line 1. However, few thousands of lines behave differently (see example below).
Input file:
A 1
A 2
A 3
A 10
A 11
A 12
A 40
A 41
I would like to extract the pair of two lines that do not respect the 1 increment in column 2.
Desired output file:
A 3
A 10
A 12
A 40
Is there (preferentially) an awk command that allows to do that? I tried several codes comparing column 2 of two consecutive lines but unfortunately I fail until now (see the code below).
awk 'FNR==1 {print; next} $2==p2 1 {print p $0; p=""; next} {p=$0 ORS; p2=$2}' input.txt > output.txt
Thanks for your help. Best,
CodePudding user response:
Assumptions:
- lines of interest must have the same value in
COLUMN 1
(ie, if the values inCOLUMN 1
differ then we don't bother with comparing the values inCOLUMN 2
and instead move on to the next input line) - if 3 consecutive lines meet the criteria, the 2nd/middle line is only printed once
- always print the 2x header lines regardless of whether or not we find any lines of interest (OP's description makes it sound like there will always be at least one set of rows that match the search requirement)
- each line contains 3x literal
|
characters, with a space separating each|
from the contents ofCOLUMN 1/2
Setup:
$ cat input.txt
| COLUMN 1 | COLUMN 2 |
| -------- | -------- |
| A | 1 |
| A | 2 |
| A | 3 | # match
| A | 10 | # match
| A | 11 |
| A | 12 | # match
| A | 23 | # match
| A | 40 | # match
| A | 41 |
| X to Z | 101 |
| X to Z | 102 | # match
| X to Z | 104 | # match
| X to Z | 105 |
NOTE: comments only added here to highlight the lines that match the search criteria
One awk
idea:
awk -F'|' '
FNR<=2 { print; next }
FNR==3 { prev2=$2; prev3=$3-1 }
{ if ($2 == prev2 && $3 0 != prev3 1) {
if (prevline) print prevline
print
prevline="" # make sure this line is not printed again if next line also meets criteria
}
else
prevline=$0
prev2=$2
prev3=$3
}
' input.txt
This generates:
| COLUMN 1 | COLUMN 2 |
| -------- | -------- |
| A | 3 |
| A | 10 |
| A | 12 |
| A | 23 |
| A | 40 |
| X to Z | 102 |
| X to Z | 104 |
CodePudding user response:
I like perl for the text processing that needs arithmetic.
$ perl -ane 'print and next if $.<3; print $p and print if $F[3]!=$fp 1; $fp=$F[3]; $p=$_' input.txt
| COLUMN 1 | COLUMN 2 |
| -------- | -------- |
| A | 3 |
| A | 10 |
| A | 12 |
| A | 40 |
- This is using
-a
to autosplit into@F
. - Prints first 2 lines:
print and next if $.<3
- On subsequent lines, prints previous line and current line if the 4th field isn't exactly one more than the prior 4th field:
print $p and print if $F[3]!=$fp 1
- Saves the 4th field as
$fp
and the entire line as$p
:$fp=$F[3]; $p=$_
CodePudding user response:
Would you please try the following:
awk 'NR>1 {if ($2!=p2 1) print p ORS $0} {p=$0; p2=$2}' input.txt > output.txt
Output:
A 3
A 10
A 12
A 40
- The variables names are similar to yours:
p
holds the previous line andp2
holds the second columns of the previous line. - The condition
NR>1
suppresses to print on the 1st line. if ($2!=p2 1) print p ORS $0
prints the pairs of two lines which meet the condition.- The block
{p=$0; p2=$2}
preserves values of current line for the next iteration.