How to get from a file only the character with reputed value-CodePudding

I need to extract from the file the words that contain certain letters in a certain amount.

I apologize if this question has been resolved in the past, I just did not find anything that fits what I am looking for.

File:

wab 12aaabbb  abababx ab ttttt baaabb zabcabc
baab baaabb cbaab  ab  ccabab zzz

For example

1. If I chose the letters a and the number is 1 the output should be:
   wab 
   ab 
   ab  
   //only the words that contains a and the char appear in the word 1 time

2. If I chose the letters a,b and the number is 3, the output should be:

   12aaabbb
   abababx
   baaabb
   //only the word contains a,b, and both chars appear in the word 3 times

3. If I chose the letters a,b,c and the number 2, the output should be:

   ccabab
   zabcabc
   //only the words that contains a,b,c and the chars appear in the word 3 times

Is it possible to find 2 letters in the same script? I was able to find in a single letter but I get only the words where the letters appear in sequence and I do not want to find only these words, that's what I did:

 egrep '([a])\1{N-1}' file

And another problem I can not get only the specific words, I get all file and the letter I am looking for "a" in red. I tried using -w but it does not display anything.

CodePudding user response：

There are various ways to split input so that grep sees a single word per line. tr is most common. For example:

tr -s '[:space:]' '\n' file | ...

We can build a function to find a specific number of a particular letter:

NofL(){
    num=$1
    letter=$2
    regex="^[^$letter]*($letter[^$letter]*){$num}$"
    grep -E "$regex"
}

Then:

# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a

# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b

# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c

CodePudding user response：

You can match a string containing exactly N occurrences of character X with the (POSIX-extended) regexp [^X]*(X[^X]*){N}. To do this for multiple characters you could chain them, and the traditional way to process one 'word' at a time, simplistically defined as a sequence of non-whitespace chars, is like this

<infile tr -s ' \t\n' ' ' | grep -Ex '[^a]*(a[^a]*){3}' | \grep -Ex '[^b]*(b[^b]*){3}' 
# may need to add \r on Windows-ish systems or for Windows-derived data

If you get colorized output from egrep and grep and maybe some other utilities it's usually because in a GNU-ish environment you -- often via a profile that was automatically provided and you didn't look at or modify -- set aliases to turn them into e.g. egrep --color=auto or possibly/rarely =always; using \grep or command grep or the pathname such as /usr/bin/grep disables the alias, or you could just un-set it/them. Another possibility is you may have envvar(s) set in which case you need to remove or suppress it/them, or explicitly say --color=never, or (somewhat hackily) pipe the output through ... | cat which has the effect of making [e]grep's stdout a pipe not a tty and thus turning off =auto.

However, GNU awk (not necessarily others) can also do this more directly:

<infile awk -vRS='[ \t\n] ' -F '' '{delete f;for(i=1;i<=NF;i  )f[$i]  }
    f["a"]==3&&f["b"]==3'

or to parameterize the criteria:

<infile awk -vRS='[ \t\n] ' -F '' 'BEGIN{split("ab",w,//);n=3}
    {delete f;for(i=1;i<=NF;i  )f[$i]  ;s=1;for(t in w)if(f[w[t]]!=occur)s=0} s'

perl can do pretty much everything awk can do, and so can some other general-purpose tools, but I leave those as exercises.

CodePudding user response：

Regexes are not really suited for that job as there are more efficient ways, but it is possible using repeated matching. We first select all words, from those we select words with n as, and from those we select words with n bs and so on.

Example for n=3 and a, b:

grep -Eo '[[:alnum:]] ' |
grep -Ex '[^a]*a[^a]*a[^a]*a[^a]*' |
grep -Ex '[^b]*b[^b]*b[^b]*b[^b]*'

To auto-generate such a command from an input like 3 a b, you need to dynamically create a pipeline, which is possible, but also a hassle:

exactly_n_times_char() {
    (( $# >= 2 )) || { cat; return; }
    local n="$1" char="$2" regex
    regex="[^$char]*($char[^$char]*){$n}"
    shift 2
    grep -Ex "$regex" | exactly_n_times_char "$n" "$@"
}
grep -Eo '[[:alnum:]] ' file.txt | exactly_n_times_char 3 a b

With PCREs (requires GNU grep or pcregrep) the check can be done in a single regex:

exactly_n_times_char() {
    local n="$1" regex=""
    shift
    for char; do  # could be done without a loop using sed on $*
        regex ="(?=[^$char\\W]*($char[^$char\\W]*){$n})"
    done
    regex ='\w '
    grep -Pow "$regex"
}
exactly_n_times_char 3 a b < file.txt

If a matching word appears multiple times (like baaabb in your example) it is printed multiple times too. You can filter out duplicates by piping through sort -u but that will change the order.

CodePudding user response：

A method using sed and bash would be:

#!/bin/bash

file=$1
n=$2
chars=$3

for ((i = 0; i < ${#chars};   i)); do
    c=${chars:i:1}
    args =(-e)
    args =("/^\\([^$c]*$c\\)\\{$n\\}[^$c]*\$/!d")
done

sed "${args[@]}" <(tr -s '[:blank:]' '\n' < "$file")

Notice that filename, count, and characters are parameterized. Use it as

./script filename 2 abc

which should print out

zabcabc
ccabab

given the file content in the question.