I want to extract every word that comes after the pattern, however, I can only extract the word is in the same line with the pattern, if the word is come right after a line break I'm not able to get it. For example,
Gary is a college student.
Steve and John are college
teachers.
I want to extract "student" and "teachers", but I only got "student" back. My solution is
grep -oP '(?<=college )[\w ]*' | sort | uniq
CodePudding user response:
Tools like grep
are fundamentally line oriented. GNU grep has a -z
option to use 0 bytes as delimiters instead of newlines, though, which will let you treat the input file as a single big 'line':
$ grep -Pzo 'college\s \K\w ' input.txt | tr '\0' '\n'
student
teachers
CodePudding user response:
grep
(or really, generally, most Unix text processing tools) examine a single line, and can't straddle a match across line boundaries. A simple Awk script might work instead:
awk '{ for(i=1; i<NF; i)
if ($i=="college") print $(i 1) }
$NF=="college" { n=1 }
n { print $1; n=0 }' file
You can easily refactor this to count the number of hits in Awk, too, and avoid the pipe to sort | uniq
(or, better, sort -u
), but I left that as an exercise. Learning enough Awk to write simple scripts like this yourself is time well spent.