I have some data that looks like this:
chr1 3861154 N 20
chr1 3861155 N 20
chr1 3861156 N 20
chr1 3949989 N 22
chr1 3949990 N 22
chr1 3949991 N 22
What I need to do is to give a code based on column 2. If the value equals the value of previous line plus one, then they come from the same series and I need to give them the same code in a new column. That code could be the value of the first line of that series. The desired output for this example would be:
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861154
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949989
I was thinking of using awk, but of course that's not a requirement. Any ideas of how could I make this work?
Edit to add the code I'm working in:
awk 'BEGIN {var = $2} {if ($2 == var 1) print $0"\t"var; else print $0"\t"$2; var = $2 }' test
I think the idea is there, but it's not quite right yet. The result I'm getting is:
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861155
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949990
Thanks!
CodePudding user response:
$ cat tst.awk
(NR == 1) || ($2 != (prev 1)) {
val = $2
}
{
print $0, val
prev = $2
}
$ awk -f tst.awk file
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861154
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949989
The big mistake in your script was this part:
BEGIN {var = $2}
because:
$2
is the 2nd field of the current line of input.BEGIN
is executed before any input lines have been read.
So the value of $2
in the BEGIN
section is zero-or-null just like any other unset variable.