What's the best and simplest way to look extract the fruit_name and fruit_colour from a dataframe of one messy column? Each fruit is listed in order but there can be irrelevant detail in between
df
col
1 fruit_name
2 apple
3 fruit_colour
4 green
5 fruit_number
6 10
7 no information
...
10 fruit_name
11 banana
12 irrelevant_nonsense
13 fruit_colour
14 yellow
14 genearl_fin
15 fruit_name
...
Expected Output
fruit_name fruit_colour
0 apple green
1 banana yellow
CodePudding user response:
A funky way is to use .shift
…
pd.DataFrame({"fruit_name": df[df.shift(1).eq('fruit_name').values].iloc[:, 0].to_list(),
"fruit_colour":df[df.shift(1).eq('fruit_colour').values].iloc[:,0].to_list()})
## Output
# fruit_name fruit_colour
# 0 apple green
# 1 banana yellow
I don’t think it is effective. You will also have issues if there are uneven values ;)
CodePudding user response:
We can do mask
and ffill
cumcount
to create the hidden pivot
index and column
df['col1'] = df.col.where(df.col.str.startswith('fruit')).ffill()
df = df.loc[df.col!=df.col1]
out = df.assign(key=df.groupby('col1').cumcount()).pivot(index='key',columns='col1',values='col')
out
col1 fruit_colour fruit_name fruit_number
key
0 green apple 10
1 yellow banana noinformation
2 genearl_fin irrelevant_nonsense NaN
CodePudding user response:
You can extract the values of fruit_names
and fruit_colours
by filtering df.shift(-1)
, then combine the series in a new df:
import pandas as pd
df = pd.DataFrame({'col': ['fruit_name', 'apple', 'fruit_colour', 'green','fruit_number',10,'no information','fruit_name','banana','irrelevant_nonsense','fruit_colour','yellow','genearl_fin','fruit_name'], })
fruit_names = df['col'].shift(-1)[df['col'] == 'fruit_name']
fruit_colours = df['col'].shift(-1)[df['col'] == 'fruit_colour']
new_df = pd.DataFrame(zip(fruit_names, fruit_colours), columns=['fruit_names', 'fruit_colours'])
output:
fruit_names | fruit_colours | |
---|---|---|
0 | apple | green |
1 | banana | yellow |