How to extract specific words from pieces of text, using a dictionary of words in categories?-CodePudding

I'm wanting to extract specific words from text in a data frame. These words I've inputted in a list in a dictionary and they fall under certain categories (the keys). From this I want to create columns that correspond to categories that store the words. As always, it's best illustrated by example:

I have a data frame:

df = pd.DataFrame({'Text': ["This car is fast, agile and large and wide", "This wagon is slow, sluggish, small and compact with alloy wheels"]}  )

Which creates the table:

    Text
0   This car is fast, agile and large and wide
1   This wagon is slow, sluggish, small and compact with alloy wheels

And a dictionary of words within categories I want to extract from them. The words are all natural language words without symbols and can include phrases, such as "alloy wheels" in this example" (this doesn't have to be a dictionary, I just felt this was the best approach):

myDict = {
  "vehicle": ["car", "wagon"],
  "speed": ["fast", "agile", "slow", "sluggish"],
  "size": ["large", "small", "wide", "compact"]
  "feature": ["alloy wheels"]
}

And from this I am wanting to create a table that looks like this:

|     Text                                                          | vehicle | speed          | size           | feature      |
| ----------------------------------------------------------------- | ------- | -------------- | -------------- | ------------ |
| This car is fast, agile and large and wide                        | car     | fast, agile    | large, wide    | NaN          |
| This wagon is slow, sluggish, small and compact with allow wheels | wagon   | slow, sluggish | small, compact | alloy wheels |

Cheers for the help in advance! Would love to use regex but any solutions welcome!

CodePudding user response：

There are many ways you could tackle this. One approach I'd maybe start with is: define a function which returns a list of words if they match your sentence.

    def get_matching_words(sentence, category_dict, category):
        
        matching_words = list()

        for word in category_dict[category]:
             if word in sentence.split(" "):
                   matching_words.append(word)

        return matching_words

Then, you want to apply this function to your pandas dataframe.

    df["vehicle"] = df["Text"].apply(lambda x: get_matching_words(x, "vehicle", my_dict))

    df["speed"] = df["Text"].apply(lambda x: get_matching_words(x, "speed", my_dict))

The only thing to add here would be to concatenate the list into a string, instead of returning a list.

def get_matching_words(sentence, category_dict, category):
        
        matching_words = list()

        for word in category_dict[category]:
             if word in sentence:
                   matching_words.append(word)

        return ",".join(matching_words)