Finding the number of unique values that contain another set of unique values-CodePudding

For example my text file looks something like this:

year, user, tweet
2009, Katie, I love playing football
2010, James, I play football
2013, Bob, I play basketball
2013, James, I play Baseball

The delimiter is ',' and I want to count how many unique users have mentioned the exact word 'play' in their tweet using BASH in a one liner.

The output of this should be 2, as James mentions 'play' twice and Bob once (No Katie as her word is 'playing'), so 2 people.

I have tried this:

$ cut -d ',' -f 2,3 Dataset.txt | grep "\<play\>" | sort | uniq -c

CodePudding user response：

The problem with your pipeline is while uniq -c will provide a count of the unique occurrences, but "James, I play Baseball" and "James, I play football" will be considered unique. You can limit the check to the first N characters with the -w N option to uniq (in your case -w3), but you are much better off (and much, much more efficient) using a single call to awk.

Here you are concerned with the 2nd field (the name) and whether play occurs in the record. You can use /play[[:blank:]]/ (or /[[:blank:]]play[[:blank:]]/) as the test for "play" alone). Then each time a record containing "play" alone is encountered, you save the number in the array a[] indexed by the name (e.g. a[$2]). You just increment the number in the index for each name and then using the END rule, you output the name and the number of occurrences.

That makes the task quite simple, e.g.

awk -F, '/[[:blank:]]play[[:blank:]]/{a[$2]  } END {for (i in a) print i, a[i]}' Dataset.txt

Output

 James 2
 Bob 1