I am trying to create a set of columns from a list taking a string from another column.
I have found a temporary solution in this post but it only creates one column if, for example, I have in String1 "I have a dog and a cat".
In [7]: df["animal"] = df["String1"].map(lambda s: next((animal for animal in search_list if animal in s), "other"))
...:
In [8]: df
Out[8]:
weight String1 animal
0 70 Labrador is a dog dog
1 10 Abyssinian is a cat cat
2 65 German Shepard is a dog dog
3 1 pigeon is a bird other
How could I create two columns, like ['animal_1'] and ['animal_2'] to have both "dog" (in ['animal_1']) and "cat" in ['animal_2']?
Desired output would be like below:
weight String1 animal_1 animal_2
0 70 Labrador is a dog dog
1 10 Abyssinian is a cat cat
2 65 German Shepard is a dog dog
3 1 pigeon is a bird other
4 30 I have a dog and a cat dog cat
CodePudding user response:
You can use:
animals = ['dog', 'cat']
regex = '|'.join(animals)
out = (df.join(
df['String1'].str.extractall(fr'\b({regex})\b')[0].unstack()
.rename(columns=lambda x: f'animal_{x 1}')
)
.fillna({'animal_1': 'other'})
)
Output:
weight String1 animal_1 animal_2
0 70 Labrador is a dog dog NaN
1 10 Abyssinian is a cat cat NaN
2 65 German Shepard is a dog dog NaN
3 1 pigeon is a bird other NaN
4 30 I have a dog and a cat dog cat
CodePudding user response:
It's a good idea to compile the regex at the beginning and use the compiled regex in the loop.
import re
import pandas as pd
ANIMALS = {"dog", "cat"}
PATTERN = re.compile("|".join(rf"\b{x}\b" for x in ANIMALS))
data = {"String1": ["Labrador is a dog", "Abyssinian is a cat", "German Shepard is a dog", "pigeon is a bird", "I have a dog and a cat"]}
df = pd.DataFrame(data)
for ix, item in df["String1"].items():
for i, animal in enumerate(pattern.findall(item)):
df.loc[ix, f"animal_{i 1}"] = animal
df.fillna({"animal_1": "other"}, inplace=True)