Compare each string element in a dataframe to a list and assign it to a column, python pandas-CodePudding

How to rearrange my dataframe according to column names while searching for specific strings in cells?

My dataframe:

0	1	2	3	4
apple pie	banana bread	orange juice	nan	nan
apple cookies	orange lemonade	nan	nan	nan
banana muffin	orange ice	berry candy	nan	nan
berry juice	nan	nan	nan	nan

I want to arrange the rows according to a list of column names, which look for specific strings of text.

apple	banana	orange	berry	lemon
apple pie	banana bread	orange juice	nan	nan
apple cookies	nan	orange lemonade	nan	nan
nan	banana muffin	orange ice	berry candy	nan
nan	nan	nan	berry juice	nan

I have tried to create a column/list for each fruit, searching for the right string and adding the cell if it matches, however I do not know how to iterate through the dataframe and assign values. I just get a column of Nan's.

col_names = ['apple', 'banana', 'orange', 'berry', 'lemonade']
apples = np.where(df_fruits.str.contains("apple", case=False, na=False), df_fruits, np.nan)
bananas = np.where(df_fruits.str.contains("banana", case=False, na=False), df_fruits, np.nan)
etc...

Edit: I got the dataframe from a csv-file, so the original data format is in rows of string: "apple pie, banana bread, orange juice, nan, nan" etc.

CodePudding user response：

we can do some re-shaping using .unstack and .str.extractall

pat = '|'.join(col_names)

s = df.stack()

s1 = s.to_frame('vals').join(
      s.str.extractall(f'({pat})').groupby(level=[0,1]).agg(list))


out = s1.explode(0).set_index(0,append=True).reset_index(1,drop=True).unstack(-1)

print(out)

            vals
0          apple         banana        berry         lemonade           orange
0      apple pie   banana bread          NaN              NaN     orange juice
1  apple cookies            NaN          NaN  orange lemonade  orange lemonade
2            NaN  banana muffin  berry candy              NaN       orange ice
3            NaN            NaN  berry juice              NaN              NaN

# if you want to drop the level on the multi index.
out.columns = out.columns.droplevel(None)

0          apple         banana        berry         lemonade           orange
0      apple pie   banana bread          NaN              NaN     orange juice
1  apple cookies            NaN          NaN  orange lemonade  orange lemonade
2            NaN  banana muffin  berry candy              NaN       orange ice
3            NaN            NaN  berry juice              NaN              NaN

CodePudding user response：

Try this:

list_values=[item for value in df_fruits.values for item in value]
list_series=[] 
for col in col_names:
   locals()[col "series"]=pd.Series(map(lambda x:x*(col in str(x)),list_values)
   list_series.append(eval(col "series"))

the first row is the get all your dataframe colums values into a list next we create a pandas series for every fruit type and append it into a list after we create a new data frame

new_df=pd.concat(list_series,axis=1)