Home > OS >  grep/perl regex for finding a header and a matching line
grep/perl regex for finding a header and a matching line

Time:06-25

Let's say I have a file, called courses.txt with contents like below. the file has sections(course providers and my email used) followed by various courses. example : edX ([email protected]) and then the various course names, each preceded by the serial number.

udemy ([email protected])  
"=========================="-  
1) foo bar
2) java programming language
3) redis stephen grider
4) javascript
5) react with typescript
6) kotlin
7) Etherium and Solidity : the Complete Developer's Guide
8) reactive programming with spring  


coursera ([email protected])  
"==========================-"  
1) python
2) typescript
3) java concurrency
4) C#

edX ([email protected])  
"==========================-"  
1) excel
2) scala
3) risk management
4) stock
5) oracle
6) mysql  
7) java  
==========================-    
<br>

Question : I want to grep for a course, say "java". I want a match which shows me the particular line(s) of the match(example : "java") and the corresponding section name(say, "edX ([email protected])" ).

if I want to search for "java" what "regex" will give me following matches (I use grep/perl on windows):

  <br>
udemy ([email protected])    
2) java programming language  

coursera ([email protected])  
3) java concurrency

edX ([email protected])    
7) java    

I tried lookbehind/lookahead but couldn't figure out how to print the course provider name with email and the course name.

Thoughts?

CodePudding user response:

I won't give you a complete solution, but you can start with this:

grep -iE "java|@" filename.txt

Some explanation:

  • the -i makes it case insensitive
  • the -E uses extended regular expressions
  • the | is an example of those extended regular expressions and it means "OR": show the lines which contain 'java' OR '@' (the latter being all the email adresses)

As a result, you get a file with all the e-mail addresses, and all the 'java' courses, together with a catch: if a line with an e-mail address is followed by another line with an e-mail address, then there's no 'java' course for that address. Hence, you can now use Perl and remove the e-mail addresses where the next line also is an e-mail address.

CodePudding user response:

Looking at the input data we can conclude that section starts with a line which includes email address.

Data for the section starts with serial number.

Based on this information we can build a hash %sections with line which includes email as a key, and all lines starting with serial number can be stored in an array under the key.

Once the hash is build the code goes through all sections and looks for lines which include search term, if the term found the output section with matching line.

Note: to work on real file replace <DATA> with <> then run as ./script.pl filename.dat

use strict;
use warnings;
use feature 'say';

my($lookfor, %sections, $key);

$lookfor = shift || die "Provide search term";

while( <DATA> ) {
    chomp;
    $key = $_ if /@/;
    push @{$sections{$key}}, $_ if /^\d\) /;
}

for my $section (keys %sections ) {
    for( @{$sections{$section}} ) {
        say "$section\n"
          . '-' x 30
          . "\n$_\n" if /\b$lookfor\b/i;
    }
}

exit 0;

__DATA__
udemy ([email protected])  
"=========================="-  
1) foo bar
2) java programming language
3) redis stephen grider
4) javascript
5) react with typescript
6) kotlin
7) Etherium and Solidity : the Complete Developer's Guide
8) reactive programming with spring  


coursera ([email protected])  
"==========================-"  
1) python
2) typescript
3) java concurrency
4) C#

edX ([email protected])  
"==========================-"  
1) excel
2) scala
3) risk management
4) stock
5) oracle
6) mysql  
7) java  
==========================-    
<br>

Output

edX ([email protected])
------------------------------
7) java

coursera ([email protected])
------------------------------
3) java concurrency

udemy ([email protected])
------------------------------
2) java programming language

  • Related