So I'm creating an analytics bot for my EPQ that counts the number of time a specific hashtag is used. How would I go about checking if a word in a string of other words contains a #
?
CodePudding user response:
test = " if a word in a string of other words contains a #"
if "#" in test:
print("yes")
CodePudding user response:
A first approach can check if a string has a substring using in
, and gather a count for each unique word using a dictionary:
texts = ["it's friday! #TGIF", "My favorite day! #TGIF"]
counts = {}
for text in texts:
for word in text.split(" "):
if "#" not in word:
continue
if word not in counts:
counts[word] = 0
counts[word] = 1
print(counts)
# {'#TGIF': 2}
This could be improved further with:
- using
str.casefold()
to normalize text with different casings - using regex to ignore certain chars, eg '#tgif!' should be parsed as '#tgif'
CodePudding user response:
You already have a decent answer, so it really just comes down to what kind of data you want to end up with. Here's another solution, using Python's re
module on the same data:
import re
texts = ["it's friday! #TGIF #foo", "My favorite day! #TGIF"]
[re.findall('#(\w )', text) for text in texts]
Regex takes some getting used to. The '#(\w )'
'captures' (with the parentheses) the 'word' (\w
) after any hash characters ('#'
). It results in a list of hashtags for each 'document' in the dataset:
[['TGIF', 'foo'], ['TGIF']]
Then you could get the total counts with this trick:
from collections import Counter
from itertools import chain
Counter(chain.from_iterable(finds))
Yielding this dictionary-like thing:
Counter({'TGIF': 2, 'foo': 1})