Home > Software design >  Count a word in a line only once while for looping
Count a word in a line only once while for looping

Time:09-16

I really need help for a specific approach.

I have a list like this;

promoter-TSS (NM_004753)
intron (NM_001013630, intron 2 of 3)
Intergenic  
intron (NM_001135610, intron 6 of 7)
Intergenic  
Intergenic  
Intergenic  
intron (NM_201628, intron 1 of 14)

and for example intron is counting more than once in a line. I want to count every word in a single line only once.

For the above list, the output should be ;

promoter count : 1
intron count : 3 #not 6
Intergenic count : 5

Generally, I manipulate these kind of txt files in command line. I need to run this for a big set so that I really need help! Thank you so so much

CodePudding user response:

With awk:

$ awk -F'[ \t] |-' '
    { counts[$1]   }
    END { for (c in counts) printf "%s count : %d\n", c, counts[c] }' input.txt
Intergenic count : 4
intron count : 3
promoter count : 1

CodePudding user response:

When the first word is followed by - or a space, you can count the first words with

sed 's/[ -].*//' file | sort | uniq -c

CodePudding user response:

Assumptions:

  • fields are delimited by non-alphanumerics (ie, ^[:alnum:])
  • using case insensitive comparisons
  • in OP's expected output intergeneric count should be 4
  • ignore blank lines

Sample data:

$ cat list.dat
promoter-TSS (NM_004753)
intron (NM_001013630, intron 2 of 3)
Intergenic
intron (NM_001135610, intron 6 of 7)
Intergenic
Intergenic
Intergenic
Intron (NM_201628, intron 1 of 14)              # case change for first character
                                                # blank line
Intergenic7 e3u                                 # first delimiter == " "
intron9(NM_201628, intron 1 of 14)              # first delimiter == "("

One awk idea:

awk '
    { split($0,arr,"[^[:alnum:]]")              # split line using all non-alphanumerics as delimiters
      if ( arr[1] != "" )                       # if not a blank line ...
         count[tolower(arr[1])]                 # lowecase/count the first field
    }
END { for (i in count)                          # loop through list of words
      printf "%s count : %s\n", i, count[i]
    }
' list.dat

# or as a one-liner (sans comments)

awk '{split($0,arr,"[^[:alnum:]]"); if (arr[1] != "") count[tolower(arr[1])]  } END {for (i in count) printf "%s count : %s\n", i, count[i]}' list.dat

This generates:

intergenic7 count : 1
intron9 count : 1
intron count : 3
intergenic count : 4
promoter count : 1

NOTES:

  • we're using tolower() to facilitate case-insensitive comparison so all output is lowercase
  • OP hasn't stipulated how the output is to be used so this solution could be further modified to take into consideration display order, saving to a file, saving to an associative array, etc
  • Related