Home > Blockchain >  Create new indicator columns based on values in another column
Create new indicator columns based on values in another column

Time:12-18

I have some data that looks like this:

import pandas as pd

fruits = ['apple', 'pear', 'peach']

df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash']})

print(df.head())

                              col1
0                  i want an apple
1                     i hate pears
2  please buy a peach and an apple
3                    I want squash

I need a solution that creates a column for each item in fruits and gives a 1 or 0 value indicating whether or not col contains that value. Ideally, the output will look like this:

goal_df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash'],
                        'apple': [1, 0, 1, 0],
                        'pear': [0, 1, 0, 0],
                        'peach': [0, 0, 1, 0]})

print(goal_df.head())


                              col1  apple  pear  peach
0                  i want an apple      1     0      0
1                     i hate pears      0     1      0
2  please buy a peach and an apple      1     0      1
3                    I want squash      0     0      0

I tried this but it did not work:

for i in fruits:
    if df['col1'].str.contains(i):
        df[i] = 1
    else:
        df[i] = 0

CodePudding user response:

items = ['apple', 'pear', 'peach']
for it in items:
    df[it] = df['col1'].str.contains(it, case=False).astype(int)

Output:

>>> df
                              col1  apple  pear  peach
0                  i want an apple      1     0      0
1                     i hate pears      0     1      0
2  please buy a peach and an apple      1     0      1
3                    I want squash      0     0      0

CodePudding user response:

Use str.extractall to extract the words, then pd.crosstab:

pattern = f"({'|'.join(fruits)})"
s = df['col1'].str.extractall(pattern)
df[fruits] = (pd.crosstab(s.index.get_level_values(0), s[0].values)
                .re_index(index=df.index, columns=fruits, fill_value=0)
             )

Output:

                              col1  apple  pear  peach
0                  i want an apple      1     0      0
1                     i hate pears      0     1      0
2  please buy a peach and an apple      1     0      1
3                    I want squash      0     0      0

CodePudding user response:

Try:

  1. Get all matching fruits using str.extractall
  2. Use pd.get_dummies to get indicator values
  3. join to original DataFrame
matches = pd.get_dummies(df["col1"].str.extractall(f"({'|'.join(fruits)})")[0].droplevel(1, 0))
output = df.join(matches.groupby(level=0).sum()).fillna(0)

>>> output
                              col1  apple  peach  pear
0                  i want an apple    1.0    0.0   0.0
1                     i hate pears    0.0    0.0   1.0
2  please buy a peach and an apple    1.0    1.0   0.0
3                    I want squash    0.0    0.0   0.0

CodePudding user response:

You can use below for apple column and do same for others

def has_apple(st):
    if "apple" in st.lower():
        return 1
    return 0
df['apple'] = df['col1'].apply(has_apple)

CodePudding user response:

I thought of another, completely different one-liner:

df[items] = df['col1'].str.findall('|'.join(items)).str.join('|').str.get_dummies('|')

Output:

>>> df
                              col1  apple  pear  peach
0                  i want an apple      1     0      0
1                     i hate pears      0     0      1
2  please buy a peach and an apple      1     1      0
3                    I want squash      0     0      0

CodePudding user response:

Try using np.where from the numpy library:

fruit = ['apple', 'pear', 'peach']
    for i in fruit:
        df[i] = np.where(df.col1.str.contains(i), 1, 0)
  • Related