Home > Software engineering >  Extract Data from 1 column
Extract Data from 1 column

Time:10-04

What's the best and simplest way to look extract the fruit_name and fruit_colour from a dataframe of one messy column? Each fruit is listed in order but there can be irrelevant detail in between

df

    col
1   fruit_name
2   apple
3   fruit_colour
4   green
5   fruit_number
6   10
7   no information
...
10  fruit_name
11  banana
12  irrelevant_nonsense
13  fruit_colour
14  yellow
14  genearl_fin
15  fruit_name
...

Expected Output

    fruit_name   fruit_colour   
0   apple         green
1   banana        yellow

CodePudding user response:

A funky way is to use .shift

…

pd.DataFrame({"fruit_name": df[df.shift(1).eq('fruit_name').values].iloc[:, 0].to_list(),
"fruit_colour":df[df.shift(1).eq('fruit_colour').values].iloc[:,0].to_list()})


## Output
#   fruit_name  fruit_colour
# 0 apple   green
# 1 banana  yellow

I don’t think it is effective. You will also have issues if there are uneven values ;)

CodePudding user response:

We can do mask and ffill cumcount to create the hidden pivot index and column

df['col1'] = df.col.where(df.col.str.startswith('fruit')).ffill()
df = df.loc[df.col!=df.col1]
out = df.assign(key=df.groupby('col1').cumcount()).pivot(index='key',columns='col1',values='col')

out
col1 fruit_colour           fruit_name   fruit_number
key
0           green                apple             10
1          yellow               banana  noinformation
2     genearl_fin  irrelevant_nonsense            NaN

CodePudding user response:

You can extract the values of fruit_names and fruit_colours by filtering df.shift(-1), then combine the series in a new df:

import pandas as pd

df = pd.DataFrame({'col': ['fruit_name', 'apple', 'fruit_colour', 'green','fruit_number',10,'no information','fruit_name','banana','irrelevant_nonsense','fruit_colour','yellow','genearl_fin','fruit_name'], })

fruit_names = df['col'].shift(-1)[df['col'] == 'fruit_name']
fruit_colours = df['col'].shift(-1)[df['col'] == 'fruit_colour']

new_df = pd.DataFrame(zip(fruit_names, fruit_colours), columns=['fruit_names', 'fruit_colours'])

output:

fruit_names fruit_colours
0 apple green
1 banana yellow
  • Related