I have a large file looking like this:
esup_255_3 transdecoder 7655 8192
esup_6093_1 transdecoder 2732 2774
esup_25727_1 transdecoder 1 60
...
with columns 3 and 4 representing intervals of numbers.
I am trying to modify this file to have the list of numbers comprised within the intervals, listed in a different column (here in column 5) as follows:
esup_255_3 transdecoder 7655 8192 7655
esup_255_3 transdecoder 7655 8192 7656
esup_255_3 transdecoder 7655 8192 7657
esup_255_3 transdecoder 7655 8192 ...
esup_255_3 transdecoder 7655 8192 8192
esup_6093_1 transdecoder 2732 2774 2732
esup_6093_1 transdecoder 2732 2774 2733
esup_6093_1 transdecoder 2732 2774 ....
esup_6093_1 transdecoder 2732 2774 2774
... and so on...
I think Perl may be helpful with this, but I am very new to it. I am only proficient in bash, and here I cannot seem to find the right way to obtain what I need.
CodePudding user response:
Something like this?
perl -lne 'my ($line, $from, $to) = /^(.*\s(\d )\s (\d )\s*)$/; print "$line\t$_" for $from..$to;'
When I run it on your snippet it prints out 641 lines:
esup_255_3 transdecoder 7655 8192 7655
esup_255_3 transdecoder 7655 8192 7656
esup_255_3 transdecoder 7655 8192 7657
[...]
esup_255_3 transdecoder 7655 8192 8190
esup_255_3 transdecoder 7655 8192 8191
esup_255_3 transdecoder 7655 8192 8192
esup_6093_1 transdecoder 2732 2774 2732
esup_6093_1 transdecoder 2732 2774 2733
[...]
esup_6093_1 transdecoder 2732 2774 2773
esup_6093_1 transdecoder 2732 2774 2774
esup_25727_1 transdecoder 1 60 1
esup_25727_1 transdecoder 1 60 2
[...]
esup_25727_1 transdecoder 1 60 59
esup_25727_1 transdecoder 1 60 60
An explanation follows. Let's start with the options:
perl -lne
We'll take them right to left. The -e
(for "execute" or "evaluate") just tells Perl that the next thing on the command line is the code to run, so it won't be looking for code on standard input.
The -n
tells it to automatically iterate over its input line-by-line; it acts as though there's a while (<>) {
...}
loop wrapped around the actual code. Inside the body of the loop the current line will be found in the topic variable $_
.
The -l
tells it to strip the newlines off the input and automatically append one to each string printed out; this basically takes newlines out of the picture and simplifies the logic.
So the program will read the input line-by-line and run the code that is given as the argument to -e
on each line. Let's look at that code, which starts with this statement:
my ($line, $from, $to) = /^(.*\s(\d )\s (\d )\s*)$/;
The regular expression doesn't have an explicit string to match against, so it automatically matches against $_
, which has the current line. It must match the whole line (because of the ^
at the beginning and $
at the end). The actual line value is also captured because of the outermost parentheses, so it will be the first item returned by the match, which is assigned to the variable $line
.
The first part of the line can be anything at all (since .*
matches everything), so we're really looking at the way the string ends instead of the way it starts. The first item of interest is any whitespace character (\s
), which is there to make sure we don't miss any of the following numbers. Specifically, we're looking for one or more digits (\d
), which the parentheses capture, so that value will also be returned by the match; it's the second capture, so it goes into the second variable in the assignment, $from
. After those digits we look for more whitespace (at least one whitespace character is required but any number is allowed) followed by another sequence of digits; this second set of digits is again captured and returned, so it winds up in the last variable, $to
. Finally we allow the last set of digits to be followed by any amount of optional trailing space.
So after reading your first line, $_ = "esup_255_3 transdecoder 7655 8192 "
, the match assignment will set $line
to a copy of that whole string, $from
to 7655
, and $to
to 8192
.
Then we come to the output. This line:
print "$line\t$_" for $from .. $to;
Is a shorter way of writing this loop:
foreach $_ ($from .. $to) {
print "$line\t$_";
}
Which means it loops over the whole numbers from $from
to $to
, reusing $_
as the loop control variable (which is why we had to copy the current line into $line
). For each value in the range, it prints out a copy of the whole line, followed by a tab and the current number.