I am trying to pull some particular texts from another column in a dataset. The sample data is like this:
sample_data = pd.DataFrame({'Pet_Id': ['1', '2', '3', '4', '5'],
'Status': ['Sold', 'In store', 'Sold', 'In store', 'Sold'],
'Comment': ['6-12months - Dog -Labrador', 'Cat - BM-born 2019', 'Parrot - 12-24 months',
'Over2Y - Dog', 'Cat']})
The output is as below:
Pet_Id Status Comment
0 1 Sold 6-12months - Dog -Labrador
1 2 In store Cat - BM-born 2019
2 3 Sold Parrot - 12-24 months
3 4 In store Over2Y - Dog
4 5 Sold Cat
I wish to only get the animal type from the "Comment" column, in this case only "Dog", "Cat" or "Parrot" and put the values into a new column "Type", so the expected output is as below:
Pet_Id Status Comment Type
0 1 Sold 6-12months - Dog -Labrador Dog
1 2 In store Cat - BM-born 2019 Cat
2 3 Sold Parrot - 12-24 months Parrot
3 4 In store Over2Y - Dog Dog
4 5 Sold Cat Cat
I tried to use dict to map but not working well as there are quite a few records in the actual data, so would you please give some help? Many thanks
CodePudding user response:
Do you have a list of all possible pets? If so, you can use the apply
method (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) to find if one of the pets appears in the string
pets = ["Dog", "Cat", "Parrot"]
def search_pet(x):
for pet in pets:
if pet in x["Comment"]:
return pet
sample_data["Type"] = sample_data.apply(search_pet, axis=1)
CodePudding user response:
Use str.extractall
:
animals = ['Dog', 'Cat', 'Parrot']
pattern = fr"({'|'.join(animals)})"
sample_data['Type'] = sample_data['Comment'].str.extractall(pattern).values
Output:
>>> sample_data
Pet_Id Status Comment Type
0 1 Sold 6-12months - Dog -Labrador Dog
1 2 In store Cat - BM-born 2019 Cat
2 3 Sold Parrot - 12-24 months Parrot
3 4 In store Over2Y - Dog Dog
4 5 Sold Cat Cat
>>> print(pattern)
(Dog|Cat|Parrot)
Build the regex: fr"({'|'.join(animals)})
f
for f-strings andr
for raw-strings(...)
is for regex to capture group- '|'.join(animals) create a string where each item are separated by '|'
The regex captures one animal of the list.