Home > Software design >  How to extract specific words from pieces of text, using a dictionary of words in categories?
How to extract specific words from pieces of text, using a dictionary of words in categories?

Time:12-11

I'm wanting to extract specific words from text in a data frame. These words I've inputted in a list in a dictionary and they fall under certain categories (the keys). From this I want to create columns that correspond to categories that store the words. As always, it's best illustrated by example:

I have a data frame:

df = pd.DataFrame({'Text': ["This car is fast, agile and large and wide", "This wagon is slow, sluggish, small and compact with alloy wheels"]}  )  

Which creates the table:

    Text
0   This car is fast, agile and large and wide
1   This wagon is slow, sluggish, small and compact with alloy wheels

And a dictionary of words within categories I want to extract from them. The words are all natural language words without symbols and can include phrases, such as "alloy wheels" in this example" (this doesn't have to be a dictionary, I just felt this was the best approach):

myDict = {
  "vehicle": ["car", "wagon"],
  "speed": ["fast", "agile", "slow", "sluggish"],
  "size": ["large", "small", "wide", "compact"]
  "feature": ["alloy wheels"]
}

And from this I am wanting to create a table that looks like this:

|     Text                                                          | vehicle | speed          | size           | feature      |
| ----------------------------------------------------------------- | ------- | -------------- | -------------- | ------------ |
| This car is fast, agile and large and wide                        | car     | fast, agile    | large, wide    | NaN          |
| This wagon is slow, sluggish, small and compact with allow wheels | wagon   | slow, sluggish | small, compact | alloy wheels |

Cheers for the help in advance! Would love to use regex but any solutions welcome!

CodePudding user response:

There are many ways you could tackle this. One approach I'd maybe start with is: define a function which returns a list of words if they match your sentence.

    def get_matching_words(sentence, category_dict, category):
        
        matching_words = list()

        for word in category_dict[category]:
             if word in sentence.split(" "):
                   matching_words.append(word)

        return matching_words

Then, you want to apply this function to your pandas dataframe.

    df["vehicle"] = df["Text"].apply(lambda x: get_matching_words(x, "vehicle", my_dict))

    df["speed"] = df["Text"].apply(lambda x: get_matching_words(x, "speed", my_dict))

The only thing to add here would be to concatenate the list into a string, instead of returning a list.

def get_matching_words(sentence, category_dict, category):
        
        matching_words = list()

        for word in category_dict[category]:
             if word in sentence:
                   matching_words.append(word)

        return ",".join(matching_words)
  • Related