Home > OS >  Python: Extract Specific words/texts from a column
Python: Extract Specific words/texts from a column

Time:09-29

I am trying to pull some particular texts from another column in a dataset. The sample data is like this:

sample_data = pd.DataFrame({'Pet_Id': ['1', '2', '3', '4', '5'],
                            'Status': ['Sold', 'In store', 'Sold', 'In store', 'Sold'],
                            'Comment': ['6-12months - Dog -Labrador', 'Cat - BM-born 2019', 'Parrot - 12-24 months',
                                        'Over2Y - Dog', 'Cat']})

The output is as below:

  Pet_Id    Status                     Comment
0      1      Sold  6-12months - Dog -Labrador
1      2  In store          Cat - BM-born 2019
2      3      Sold       Parrot - 12-24 months
3      4  In store                Over2Y - Dog
4      5      Sold                         Cat

I wish to only get the animal type from the "Comment" column, in this case only "Dog", "Cat" or "Parrot" and put the values into a new column "Type", so the expected output is as below:

  Pet_Id    Status                     Comment    Type
0      1      Sold  6-12months - Dog -Labrador     Dog
1      2  In store          Cat - BM-born 2019     Cat
2      3      Sold       Parrot - 12-24 months  Parrot
3      4  In store                Over2Y - Dog     Dog
4      5      Sold                         Cat     Cat

I tried to use dict to map but not working well as there are quite a few records in the actual data, so would you please give some help? Many thanks

CodePudding user response:

Do you have a list of all possible pets? If so, you can use the apply method (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) to find if one of the pets appears in the string

pets = ["Dog", "Cat", "Parrot"]

def search_pet(x):
    for pet in pets:
        if pet in x["Comment"]:
            return pet

sample_data["Type"] = sample_data.apply(search_pet, axis=1)

CodePudding user response:

Use str.extractall:

animals = ['Dog', 'Cat', 'Parrot']
pattern = fr"({'|'.join(animals)})"
sample_data['Type'] = sample_data['Comment'].str.extractall(pattern).values

Output:

>>> sample_data
  Pet_Id    Status                     Comment    Type
0      1      Sold  6-12months - Dog -Labrador     Dog
1      2  In store          Cat - BM-born 2019     Cat
2      3      Sold       Parrot - 12-24 months  Parrot
3      4  In store                Over2Y - Dog     Dog
4      5      Sold                         Cat     Cat

>>> print(pattern)
(Dog|Cat|Parrot)

Build the regex: fr"({'|'.join(animals)})

  • f for f-strings and r for raw-strings
  • (...) is for regex to capture group
  • '|'.join(animals) create a string where each item are separated by '|'

The regex captures one animal of the list.

  • Related