I have a requirement where I have to split a large file into small files. Each line of the large file containing the matching string should be put into another file with the output file name same as the matching string. For one string I can get it done via awk as shown below.
awk '/apple/{print}' large_file.txt > apple.txt
I want a script which takes the regular expression matching string from another file and puts the results into a file with the same name as the matching string. How to get it done with awk command?
Let's say the string to be matched is put into a file called matching_string.txt the contents of which would look like this:
apple
orange
mango
If the large_file.txt is something like:
apple is a great fruit
we should eat apple
orange is juicy
mango is the king of fruits
litchi is a seasonal fruit
then the resulting file should be
apple.txt:
apple is a great fruit
we should eat apple
orange.txt:
orange is juicy
mango.txt:
mango is the king of fruits
I am new to the Linux environment and beginner level at scripting. Any other solution using regular expression, sed, python etc. should be also okay.
CodePudding user response:
Why use awk? Grep does the job too.
pattern=apple
grep -e "$pattern" large.txt > "$pattern.txt"
Write a script or a shell function. For instance, a simple shell function can be defined ad-hoc and then called.
filter() { grep -e "$1" large.txt > "$1.txt"; }
for pattern in apple orangle mango; do filter "$pattern"; done
As a shell script (e.g. filter.sh
):
#!/bin/sh
grep -e "$1" large.txt > "$1.txt"
Needless to say, the script file must have the executable bit set, otherwise it cannot be executed (obviously).
Assuming your pattern file (e.g. pattern.txt
) contains one pattern per line:
#!/bin/sh
while IFS= read -r pattern; do
filter "$pattern"
# or: ./filter.sh "$pattern"
done < pattern.txt
CodePudding user response:
To do this in awk
:
for word in $(cat matching_string.txt)
do
awk "/${word}/ { print }" large_file.txt > ${word}.txt
done
The pattern is a regex pattern followed by a command. Note that when you get into regex-capture groups, you may find that the implementation of awk varies from one platform to another.
If it is a simplistic regex, I prefer perl
because in cross-platform environments (particularly osx and git-bash on Windows), perl
has a more consistent implementation for regex handling. In this case, the perl
solution would be:
for word in $(cat matching_string.txt)
do
perl -ne "if (/${word}/) { print }" < large_file.txt > ${word}.txt
done
I wanted to also demonstrate capture groups. In this case, it is a bit of over-engineered to represent your line as 3 capture groups (prefix, word, postfix), but, I do this because it serves as a template for you to create more complex regex capture group processing scenarios:
for word in $(cat matching_string.txt)
do
perl -ne "if (/(.*)(${word})(.*)/) { print $1$2$3 . '\n' }" < large_file.txt > ${word}.txt
done