Create columns from strings that are in a list-CodePudding

I am trying to create a set of columns from a list taking a string from another column.

I have found a temporary solution in this post but it only creates one column if, for example, I have in String1 "I have a dog and a cat".

In [7]: df["animal"] = df["String1"].map(lambda s: next((animal for animal in search_list if animal in s), "other"))
   ...:

In [8]: df
Out[8]:
   weight                  String1 animal
0      70        Labrador is a dog    dog
1      10      Abyssinian is a cat    cat
2      65  German Shepard is a dog    dog
3       1         pigeon is a bird  other

How could I create two columns, like ['animal_1'] and ['animal_2'] to have both "dog" (in ['animal_1']) and "cat" in ['animal_2']?

Desired output would be like below:

   weight                  String1 animal_1 animal_2
0      70        Labrador is a dog    dog
1      10      Abyssinian is a cat    cat
2      65  German Shepard is a dog    dog
3       1         pigeon is a bird  other
4      30   I have a dog and a cat    dog   cat

CodePudding user response：

You can use:

animals = ['dog', 'cat']
regex = '|'.join(animals)

out = (df.join(
         df['String1'].str.extractall(fr'\b({regex})\b')[0].unstack()
           .rename(columns=lambda x: f'animal_{x 1}')
        )
          .fillna({'animal_1': 'other'})
     )

Output:

   weight                  String1 animal_1 animal_2
0      70        Labrador is a dog      dog      NaN
1      10      Abyssinian is a cat      cat      NaN
2      65  German Shepard is a dog      dog      NaN
3       1         pigeon is a bird    other      NaN
4      30   I have a dog and a cat      dog      cat

CodePudding user response：

It's a good idea to compile the regex at the beginning and use the compiled regex in the loop.

import re
import pandas as pd

ANIMALS = {"dog", "cat"}
PATTERN = re.compile("|".join(rf"\b{x}\b" for x in ANIMALS))

data = {"String1": ["Labrador is a dog", "Abyssinian is a cat", "German Shepard is a dog", "pigeon is a bird", "I have a dog and a cat"]}
df = pd.DataFrame(data)

for ix, item in df["String1"].items():
    for i, animal in enumerate(pattern.findall(item)):
        df.loc[ix, f"animal_{i 1}"] = animal
df.fillna({"animal_1": "other"}, inplace=True)