Home > front end >  How to convert a list of intervals into the list of numbers comprised within these intervals?
How to convert a list of intervals into the list of numbers comprised within these intervals?

Time:05-18

I have a large file looking like this:

esup_255_3      transdecoder   7655    8192         
esup_6093_1     transdecoder   2732    2774        
esup_25727_1    transdecoder   1       60 
...  

with columns 3 and 4 representing intervals of numbers.

I am trying to modify this file to have the list of numbers comprised within the intervals, listed in a different column (here in column 5) as follows:

esup_255_3      transdecoder    7655    8192    7655     
esup_255_3      transdecoder    7655    8192    7656
esup_255_3      transdecoder    7655    8192    7657 
esup_255_3      transdecoder    7655    8192     ...    
esup_255_3      transdecoder    7655    8192    8192    
esup_6093_1     transdecoder    2732    2774    2732     
esup_6093_1     transdecoder    2732    2774    2733     
esup_6093_1     transdecoder    2732    2774    ....     
esup_6093_1     transdecoder    2732    2774    2774     
... and so on...

I think Perl may be helpful with this, but I am very new to it. I am only proficient in bash, and here I cannot seem to find the right way to obtain what I need.

CodePudding user response:

Something like this?

    perl -lne 'my ($line, $from, $to) = /^(.*\s(\d )\s (\d )\s*)$/; print "$line\t$_" for $from..$to;' 

When I run it on your snippet it prints out 641 lines:

esup_255_3      transdecoder   7655    8192             7655
esup_255_3      transdecoder   7655    8192             7656
esup_255_3      transdecoder   7655    8192             7657
[...]
esup_255_3      transdecoder   7655    8192             8190
esup_255_3      transdecoder   7655    8192             8191
esup_255_3      transdecoder   7655    8192             8192
esup_6093_1     transdecoder   2732    2774         2732
esup_6093_1     transdecoder   2732    2774         2733
[...]
esup_6093_1     transdecoder   2732    2774         2773
esup_6093_1     transdecoder   2732    2774         2774
esup_25727_1    transdecoder   1       60   1
esup_25727_1    transdecoder   1       60   2
[...]
esup_25727_1    transdecoder   1       60   59
esup_25727_1    transdecoder   1       60   60

An explanation follows. Let's start with the options:

perl -lne

We'll take them right to left. The -e (for "execute" or "evaluate") just tells Perl that the next thing on the command line is the code to run, so it won't be looking for code on standard input.

The -n tells it to automatically iterate over its input line-by-line; it acts as though there's a while (<>) {...} loop wrapped around the actual code. Inside the body of the loop the current line will be found in the topic variable $_.

The -l tells it to strip the newlines off the input and automatically append one to each string printed out; this basically takes newlines out of the picture and simplifies the logic.

So the program will read the input line-by-line and run the code that is given as the argument to -e on each line. Let's look at that code, which starts with this statement:

my ($line, $from, $to) = /^(.*\s(\d )\s (\d )\s*)$/;

The regular expression doesn't have an explicit string to match against, so it automatically matches against $_, which has the current line. It must match the whole line (because of the ^ at the beginning and $ at the end). The actual line value is also captured because of the outermost parentheses, so it will be the first item returned by the match, which is assigned to the variable $line.

The first part of the line can be anything at all (since .* matches everything), so we're really looking at the way the string ends instead of the way it starts. The first item of interest is any whitespace character (\s), which is there to make sure we don't miss any of the following numbers. Specifically, we're looking for one or more digits (\d ), which the parentheses capture, so that value will also be returned by the match; it's the second capture, so it goes into the second variable in the assignment, $from. After those digits we look for more whitespace (at least one whitespace character is required but any number is allowed) followed by another sequence of digits; this second set of digits is again captured and returned, so it winds up in the last variable, $to. Finally we allow the last set of digits to be followed by any amount of optional trailing space.

So after reading your first line, $_ = "esup_255_3 transdecoder 7655 8192 ", the match assignment will set $line to a copy of that whole string, $from to 7655, and $to to 8192.

Then we come to the output. This line:

print "$line\t$_" for $from .. $to;

Is a shorter way of writing this loop:

foreach $_  ($from .. $to) {
   print "$line\t$_";
}

Which means it loops over the whole numbers from $from to $to, reusing $_ as the loop control variable (which is why we had to copy the current line into $line). For each value in the range, it prints out a copy of the whole line, followed by a tab and the current number.

  • Related