Home > Enterprise >  How would I isolate specific words containing a specific character?
How would I isolate specific words containing a specific character?

Time:06-22

So I'm creating an analytics bot for my EPQ that counts the number of time a specific hashtag is used. How would I go about checking if a word in a string of other words contains a #?

CodePudding user response:

test = " if a word in a string of other words contains a #"
if "#" in test:
    print("yes")

CodePudding user response:

A first approach can check if a string has a substring using in, and gather a count for each unique word using a dictionary:

texts = ["it's friday! #TGIF", "My favorite day! #TGIF"]
counts = {}

for text in texts:
    for word in text.split(" "):
            if "#" not in word:
                    continue
            if word not in counts:
                    counts[word] = 0
            counts[word]  = 1

print(counts)
# {'#TGIF': 2}

This could be improved further with:

  • using str.casefold() to normalize text with different casings
  • using regex to ignore certain chars, eg '#tgif!' should be parsed as '#tgif'

CodePudding user response:

You already have a decent answer, so it really just comes down to what kind of data you want to end up with. Here's another solution, using Python's re module on the same data:

import re

texts = ["it's friday! #TGIF #foo", "My favorite day! #TGIF"]

[re.findall('#(\w )', text) for text in texts]

Regex takes some getting used to. The '#(\w )' 'captures' (with the parentheses) the 'word' (\w ) after any hash characters ('#'). It results in a list of hashtags for each 'document' in the dataset:

[['TGIF', 'foo'], ['TGIF']]

Then you could get the total counts with this trick:

from collections import Counter
from itertools import chain

Counter(chain.from_iterable(finds))

Yielding this dictionary-like thing:

Counter({'TGIF': 2, 'foo': 1})
  • Related