Home > Software design >  How to grep a specific line and their subsequnet line with a specific pattern?
How to grep a specific line and their subsequnet line with a specific pattern?

Time:07-31

I have a data in the following format.

>ab:xy_a0by98-2 \Movie= top gun \actor= Tom \Genere=Action \Length=234 \Credits=30 \pe=1 \summry=(Tom|action|234)
Top Gun is a 1986 American action drama film directed by Tony Scott, and produced by Don Simpson and Jerry Bruckheimer

>ab:xy_b0ha81-5 \Movie= Thor \actor= chris hemsworth \Genere=Action \Length=321 \Credits=20 \pe=0 \summry=(chris|Action|321)
Thor embarks on a journey unlike anything he's ever faced a quest for inner peace

>ab:xy_c0ma65-1 \Movie= Batman \actor= Bale \Genere=Action \Length=251 \Credits=30 \pe=1 \summry=(Bale|Action|251)
From American Psycho to Batman Begins to Vice, Christian Bale is a bonafide A-list star
But he missed out on plenty of huge roles along the way.

>ab:xy_d0fc78-2 \Movie= Joker \actor= Phoenix \Genere=thriller \Length=341 \Credits=35 \pe=2 \summry=(phoenix|thriller|341)
Joker is a 2019 American psychological thriller film directed and produced by Todd Phillips
who co-wrote the screenplay with Scott Silver

>ab:xy_e0ra81-2 \Movie= Superman \actor= henry cavill \Genere=Action \Length=254 \Credits=28 \pe=1 \summry=(cavill|action|254)
Henry William Dalgliesh Cavill is a British actor
He is known for his portrayal of Charles Brandon in Showtime's The Tudors

I want to extract all the entries with their description (data between two >) which contain pe=1, each entiry starts with the > symobol as follows:

>ab:xy_a0by98-2 \Movie= top gun \actor= Tom \Genere=Action \Length=234 \Credits=30 \pe=1 \summry=(Tom|action|234)
Top Gun is a 1986 American action drama film directed by Tony Scott, and produced by Don Simpson and Jerry Bruckheimer

>ab:xy_c0ma65-1 \Movie= Batman \actor= Bale \Genere=Action \Length=251 \Credits=30 \pe=1 \summry=(Bale|Action|251)
From American Psycho to Batman Begins to Vice, Christian Bale is a bonafide A-list star
But he missed out on plenty of huge roles along the way.

>ab:xy_e0ra81-2 \Movie= Superman \actor= henry cavill \Genere=Action \Length=254 \Credits=28 \pe=1 \summry=(cavill|action|254)
Henry William Dalgliesh Cavill is a British actor
He is known for his portrayal of Charles Brandon in Showtime's The Tudors

I tried grep 'pe=1' input.txt. But it extracts only the first line of each record. I require subsequent lines of the entry till next > symbol

CodePudding user response:

With your shown samples please try following awk code. If you are ok not to have empty lines then this will help here. Simple explanation would be, setting RS to paragraph mode and in main program checking if line starts from > and also contains pe=1 then print that line.

Also since OP confirmed there are control M characters in Input_file, so firstly remove them by doing:

tr -d '\r' < Input_file > tmp && mv tmp Input_file

Then run following code to get actual output:

awk -v RS= '/^>.*\\pe=1/' Input_file
  • Related