I really need help for a specific approach.
I have a list like this;
promoter-TSS (NM_004753)
intron (NM_001013630, intron 2 of 3)
Intergenic
intron (NM_001135610, intron 6 of 7)
Intergenic
Intergenic
Intergenic
intron (NM_201628, intron 1 of 14)
and for example intron is counting more than once in a line. I want to count every word in a single line only once.
For the above list, the output should be ;
promoter count : 1
intron count : 3 #not 6
Intergenic count : 5
Generally, I manipulate these kind of txt files in command line. I need to run this for a big set so that I really need help! Thank you so so much
CodePudding user response:
With awk
:
$ awk -F'[ \t] |-' '
{ counts[$1] }
END { for (c in counts) printf "%s count : %d\n", c, counts[c] }' input.txt
Intergenic count : 4
intron count : 3
promoter count : 1
CodePudding user response:
When the first word is followed by -
or a space, you can count the first words with
sed 's/[ -].*//' file | sort | uniq -c
CodePudding user response:
Assumptions:
- fields are delimited by non-alphanumerics (ie,
^[:alnum:]
) - using case insensitive comparisons
- in OP's expected output
intergeneric
count should be 4 - ignore blank lines
Sample data:
$ cat list.dat
promoter-TSS (NM_004753)
intron (NM_001013630, intron 2 of 3)
Intergenic
intron (NM_001135610, intron 6 of 7)
Intergenic
Intergenic
Intergenic
Intron (NM_201628, intron 1 of 14) # case change for first character
# blank line
Intergenic7 e3u # first delimiter == " "
intron9(NM_201628, intron 1 of 14) # first delimiter == "("
One awk
idea:
awk '
{ split($0,arr,"[^[:alnum:]]") # split line using all non-alphanumerics as delimiters
if ( arr[1] != "" ) # if not a blank line ...
count[tolower(arr[1])] # lowecase/count the first field
}
END { for (i in count) # loop through list of words
printf "%s count : %s\n", i, count[i]
}
' list.dat
# or as a one-liner (sans comments)
awk '{split($0,arr,"[^[:alnum:]]"); if (arr[1] != "") count[tolower(arr[1])] } END {for (i in count) printf "%s count : %s\n", i, count[i]}' list.dat
This generates:
intergenic7 count : 1
intron9 count : 1
intron count : 3
intergenic count : 4
promoter count : 1
NOTES:
- we're using
tolower()
to facilitate case-insensitive comparison so all output is lowercase - OP hasn't stipulated how the output is to be used so this solution could be further modified to take into consideration display order, saving to a file, saving to an associative array, etc