Home > database >  Compare each string element in a dataframe to a list and assign it to a column, python pandas
Compare each string element in a dataframe to a list and assign it to a column, python pandas

Time:08-06

How to rearrange my dataframe according to column names while searching for specific strings in cells?

My dataframe:

0 1 2 3 4
apple pie banana bread orange juice nan nan
apple cookies orange lemonade nan nan nan
banana muffin orange ice berry candy nan nan
berry juice nan nan nan nan

I want to arrange the rows according to a list of column names, which look for specific strings of text.

apple banana orange berry lemon
apple pie banana bread orange juice nan nan
apple cookies nan orange lemonade nan nan
nan banana muffin orange ice berry candy nan
nan nan nan berry juice nan

I have tried to create a column/list for each fruit, searching for the right string and adding the cell if it matches, however I do not know how to iterate through the dataframe and assign values. I just get a column of Nan's.

col_names = ['apple', 'banana', 'orange', 'berry', 'lemonade']
apples = np.where(df_fruits.str.contains("apple", case=False, na=False), df_fruits, np.nan)
bananas = np.where(df_fruits.str.contains("banana", case=False, na=False), df_fruits, np.nan)
etc...

Edit: I got the dataframe from a csv-file, so the original data format is in rows of string: "apple pie, banana bread, orange juice, nan, nan" etc.

CodePudding user response:

we can do some re-shaping using .unstack and .str.extractall

pat = '|'.join(col_names)

s = df.stack()

s1 = s.to_frame('vals').join(
      s.str.extractall(f'({pat})').groupby(level=[0,1]).agg(list))


out = s1.explode(0).set_index(0,append=True).reset_index(1,drop=True).unstack(-1)

print(out)

            vals
0          apple         banana        berry         lemonade           orange
0      apple pie   banana bread          NaN              NaN     orange juice
1  apple cookies            NaN          NaN  orange lemonade  orange lemonade
2            NaN  banana muffin  berry candy              NaN       orange ice
3            NaN            NaN  berry juice              NaN              NaN

# if you want to drop the level on the multi index.
out.columns = out.columns.droplevel(None)

0          apple         banana        berry         lemonade           orange
0      apple pie   banana bread          NaN              NaN     orange juice
1  apple cookies            NaN          NaN  orange lemonade  orange lemonade
2            NaN  banana muffin  berry candy              NaN       orange ice
3            NaN            NaN  berry juice              NaN              NaN

CodePudding user response:

Try this:

list_values=[item for value in df_fruits.values for item in value]
list_series=[] 
for col in col_names:
   locals()[col "series"]=pd.Series(map(lambda x:x*(col in str(x)),list_values)
   list_series.append(eval(col "series"))

the first row is the get all your dataframe colums values into a list next we create a pandas series for every fruit type and append it into a list after we create a new data frame

new_df=pd.concat(list_series,axis=1)

  • Related