Home > Software design >  How to extract the word after pattern, but the word is in next line?
How to extract the word after pattern, but the word is in next line?

Time:02-28

I want to extract every word that comes after the pattern, however, I can only extract the word is in the same line with the pattern, if the word is come right after a line break I'm not able to get it. For example,

Gary is a college student.
Steve and John are college
teachers.

I want to extract "student" and "teachers", but I only got "student" back. My solution is

grep -oP '(?<=college )[\w ]*' | sort | uniq

CodePudding user response:

Tools like grep are fundamentally line oriented. GNU grep has a -z option to use 0 bytes as delimiters instead of newlines, though, which will let you treat the input file as a single big 'line':

$ grep -Pzo 'college\s \K\w ' input.txt | tr '\0' '\n'
student
teachers

CodePudding user response:

grep (or really, generally, most Unix text processing tools) examine a single line, and can't straddle a match across line boundaries. A simple Awk script might work instead:

awk '{ for(i=1; i<NF;   i)
    if ($i=="college") print $(i 1) }
$NF=="college" { n=1 }
n { print $1; n=0 }' file

You can easily refactor this to count the number of hits in Awk, too, and avoid the pipe to sort | uniq (or, better, sort -u), but I left that as an exercise. Learning enough Awk to write simple scripts like this yourself is time well spent.

  • Related