Home > front end >  How to pass regular expression matching string from a file in awk?
How to pass regular expression matching string from a file in awk?

Time:10-06

I have a requirement where I have to split a large file into small files. Each line of the large file containing the matching string should be put into another file with the output file name same as the matching string. For one string I can get it done via awk as shown below.

awk '/apple/{print}' large_file.txt > apple.txt

I want a script which takes the regular expression matching string from another file and puts the results into a file with the same name as the matching string. How to get it done with awk command?

Let's say the string to be matched is put into a file called matching_string.txt the contents of which would look like this:

apple
orange
mango

If the large_file.txt is something like:

apple is a great fruit
we should eat apple
orange is juicy
mango is the king of fruits
litchi is a seasonal fruit

then the resulting file should be

apple.txt:

apple is a great fruit
we should eat apple

orange.txt:

orange is juicy

mango.txt:

mango is the king of fruits

I am new to the Linux environment and beginner level at scripting. Any other solution using regular expression, sed, python etc. should be also okay.

CodePudding user response:

Why use awk? Grep does the job too.

pattern=apple
grep -e "$pattern" large.txt > "$pattern.txt"

Write a script or a shell function. For instance, a simple shell function can be defined ad-hoc and then called.

filter() { grep -e "$1" large.txt > "$1.txt"; }
for pattern in apple orangle mango; do filter "$pattern"; done

As a shell script (e.g. filter.sh):

#!/bin/sh
grep -e "$1" large.txt > "$1.txt"

Needless to say, the script file must have the executable bit set, otherwise it cannot be executed (obviously).

Assuming your pattern file (e.g. pattern.txt) contains one pattern per line:

#!/bin/sh
while IFS= read -r pattern; do
  filter "$pattern"
  # or: ./filter.sh "$pattern"
done < pattern.txt

CodePudding user response:

To do this in awk:

for word in $(cat matching_string.txt)
do
  awk "/${word}/ { print }" large_file.txt > ${word}.txt
done

The pattern is a regex pattern followed by a command. Note that when you get into regex-capture groups, you may find that the implementation of awk varies from one platform to another.

If it is a simplistic regex, I prefer perl because in cross-platform environments (particularly osx and git-bash on Windows), perl has a more consistent implementation for regex handling. In this case, the perl solution would be:

for word in $(cat matching_string.txt)
do
  perl -ne "if (/${word}/) { print }" < large_file.txt > ${word}.txt
done

I wanted to also demonstrate capture groups. In this case, it is a bit of over-engineered to represent your line as 3 capture groups (prefix, word, postfix), but, I do this because it serves as a template for you to create more complex regex capture group processing scenarios:

for word in $(cat matching_string.txt)
do
   perl -ne "if (/(.*)(${word})(.*)/) { print $1$2$3 . '\n' }" < large_file.txt > ${word}.txt
done
  • Related