I need to extract from the file the words that contain certain letters in a certain amount.
I apologize if this question has been resolved in the past, I just did not find anything that fits what I am looking for.
File:
wab 12aaabbb abababx ab ttttt baaabb zabcabc
baab baaabb cbaab ab ccabab zzz
For example
1. If I chose the letters a and the number is 1 the output should be:
wab
ab
ab
//only the words that contains a and the char appear in the word 1 time
2. If I chose the letters a,b and the number is 3, the output should be:
12aaabbb
abababx
baaabb
//only the word contains a,b, and both chars appear in the word 3 times
3. If I chose the letters a,b,c and the number 2, the output should be:
ccabab
zabcabc
//only the words that contains a,b,c and the chars appear in the word 3 times
Is it possible to find 2 letters in the same script? I was able to find in a single letter but I get only the words where the letters appear in sequence and I do not want to find only these words, that's what I did:
egrep '([a])\1{N-1}' file
And another problem I can not get only the specific words, I get all file and the letter I am looking for "a" in red. I tried using -w but it does not display anything.
CodePudding user response:
There are various ways to split input so that grep sees a single word per line. tr
is most common. For example:
tr -s '[:space:]' '\n' file | ...
We can build a function to find a specific number of a particular letter:
NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}
Then:
# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a
# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b
# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c
CodePudding user response:
You can match a string containing exactly N occurrences of character X with the (POSIX-extended) regexp [^X]*(X[^X]*){N}
. To do this for multiple characters you could chain them, and the traditional way to process one 'word' at a time, simplistically defined as a sequence of non-whitespace chars, is like this
<infile tr -s ' \t\n' ' ' | grep -Ex '[^a]*(a[^a]*){3}' | \grep -Ex '[^b]*(b[^b]*){3}'
# may need to add \r on Windows-ish systems or for Windows-derived data
If you get colorized output from egrep
and grep
and maybe some other utilities it's usually because in a GNU-ish environment you -- often via a profile that was automatically provided and you didn't look at or modify -- set aliases to turn them into e.g. egrep --color=auto
or possibly/rarely =always
; using \grep
or command grep
or the pathname such as /usr/bin/grep
disables the alias, or you could just un-set it/them. Another possibility is you may have envvar(s) set in which case you need to remove or suppress it/them, or explicitly say --color=never
, or (somewhat hackily) pipe the output through ... | cat
which has the effect of making [e]grep
's stdout a pipe not a tty and thus turning off =auto
.
However, GNU awk (not necessarily others) can also do this more directly:
<infile awk -vRS='[ \t\n] ' -F '' '{delete f;for(i=1;i<=NF;i )f[$i] }
f["a"]==3&&f["b"]==3'
or to parameterize the criteria:
<infile awk -vRS='[ \t\n] ' -F '' 'BEGIN{split("ab",w,//);n=3}
{delete f;for(i=1;i<=NF;i )f[$i] ;s=1;for(t in w)if(f[w[t]]!=occur)s=0} s'
perl can do pretty much everything awk can do, and so can some other general-purpose tools, but I leave those as exercises.
CodePudding user response:
Regexes are not really suited for that job as there are more efficient ways, but it is possible using repeated matching. We first select all words, from those we select words with n a
s, and from those we select words with n b
s and so on.
Example for n=3 and a
, b
:
grep -Eo '[[:alnum:]] ' |
grep -Ex '[^a]*a[^a]*a[^a]*a[^a]*' |
grep -Ex '[^b]*b[^b]*b[^b]*b[^b]*'
To auto-generate such a command from an input like 3 a b
, you need to dynamically create a pipeline, which is possible, but also a hassle:
exactly_n_times_char() {
(( $# >= 2 )) || { cat; return; }
local n="$1" char="$2" regex
regex="[^$char]*($char[^$char]*){$n}"
shift 2
grep -Ex "$regex" | exactly_n_times_char "$n" "$@"
}
grep -Eo '[[:alnum:]] ' file.txt | exactly_n_times_char 3 a b
With PCREs (requires GNU grep
or pcregrep
) the check can be done in a single regex:
exactly_n_times_char() {
local n="$1" regex=""
shift
for char; do # could be done without a loop using sed on $*
regex ="(?=[^$char\\W]*($char[^$char\\W]*){$n})"
done
regex ='\w '
grep -Pow "$regex"
}
exactly_n_times_char 3 a b < file.txt
If a matching word appears multiple times (like baaabb
in your example) it is printed multiple times too. You can filter out duplicates by piping through sort -u
but that will change the order.
CodePudding user response:
A method using sed
and bash
would be:
#!/bin/bash
file=$1
n=$2
chars=$3
for ((i = 0; i < ${#chars}; i)); do
c=${chars:i:1}
args =(-e)
args =("/^\\([^$c]*$c\\)\\{$n\\}[^$c]*\$/!d")
done
sed "${args[@]}" <(tr -s '[:blank:]' '\n' < "$file")
Notice that filename, count, and characters are parameterized. Use it as
./script filename 2 abc
which should print out
zabcabc
ccabab
given the file content in the question.