Count a word in a line only once while for looping-CodePudding

I really need help for a specific approach.

I have a list like this;

promoter-TSS (NM_004753)
intron (NM_001013630, intron 2 of 3)
Intergenic  
intron (NM_001135610, intron 6 of 7)
Intergenic  
Intergenic  
Intergenic  
intron (NM_201628, intron 1 of 14)

and for example intron is counting more than once in a line. I want to count every word in a single line only once.

For the above list, the output should be ;

promoter count : 1
intron count : 3 #not 6
Intergenic count : 5

Generally, I manipulate these kind of txt files in command line. I need to run this for a big set so that I really need help! Thank you so so much

CodePudding user response：

With awk:

$ awk -F'[ \t] |-' '
    { counts[$1]   }
    END { for (c in counts) printf "%s count : %d\n", c, counts[c] }' input.txt
Intergenic count : 4
intron count : 3
promoter count : 1

CodePudding user response：

When the first word is followed by - or a space, you can count the first words with

sed 's/[ -].*//' file | sort | uniq -c

CodePudding user response：

Assumptions:

fields are delimited by non-alphanumerics (ie, ^[:alnum:])
using case insensitive comparisons
in OP's expected output intergeneric count should be 4
ignore blank lines

Sample data:

$ cat list.dat
promoter-TSS (NM_004753)
intron (NM_001013630, intron 2 of 3)
Intergenic
intron (NM_001135610, intron 6 of 7)
Intergenic
Intergenic
Intergenic
Intron (NM_201628, intron 1 of 14)              # case change for first character
                                                # blank line
Intergenic7 e3u                                 # first delimiter == " "
intron9(NM_201628, intron 1 of 14)              # first delimiter == "("

One awk idea:

awk '
    { split($0,arr,"[^[:alnum:]]")              # split line using all non-alphanumerics as delimiters
      if ( arr[1] != "" )                       # if not a blank line ...
         count[tolower(arr[1])]                 # lowecase/count the first field
    }
END { for (i in count)                          # loop through list of words
      printf "%s count : %s\n", i, count[i]
    }
' list.dat

# or as a one-liner (sans comments)

awk '{split($0,arr,"[^[:alnum:]]"); if (arr[1] != "") count[tolower(arr[1])]  } END {for (i in count) printf "%s count : %s\n", i, count[i]}' list.dat

This generates:

intergenic7 count : 1
intron9 count : 1
intron count : 3
intergenic count : 4
promoter count : 1

NOTES:

we're using tolower() to facilitate case-insensitive comparison so all output is lowercase
OP hasn't stipulated how the output is to be used so this solution could be further modified to take into consideration display order, saving to a file, saving to an associative array, etc