Home > Software design >  How can I use groupby with multiple values in a column in pandas?
How can I use groupby with multiple values in a column in pandas?

Time:02-23

I've a dataframe like as follows,

import pandas as pd

data = {
    'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
    'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
    'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
    'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
}

df = pd.DataFrame(data)
print(df)

      brand model  year   price
0  Mercedes     X  2011     NaN
1   Renault     Y  2010  1000.4
2      Ford     Z  2009  2000.3
3  Mercedes     X  2010  1000.0
4  Mercedes     X  2012  1100.3
5  Mercedes     X  2020  3000.5
6   Renault     Q  2011     NaN

I want to fill the missing values with the average of the observations containing year-1, year and year 1 and also same brand and model. For instance, Mercedes X model has a null price in 2011. When I look at the data,

2011 - 1 = 2010
2011   1 = 2012

The 4th observation -> Mercedes,X,2010,1000.0
The 5th observation -> Mercedes,X,2012,1100.3

The mean -> (1000.0   1100.3) / 2 = 1050.15

I've tried something as follows,

for c_key, _ in df.groupby(['brand', 'model', 'year']):
    fc = (
        (df['brand'] == c_key[0])
        & (df['model'] == c_key[1])
        & (df['year'].isin([c_key[2]   1, c_key[2], c_key[2] - 1]))
    )
    
    sc = (
        (df['brand'] == c_key[0])
        & (df['model'] == c_key[1])
        & (df['year'] == c_key[2])
        & (df['price'].isnull())
    )
        
    mean_val = df[fc]['price'].mean()

    df.loc[sc, 'price'] = mean_val

print(df)

      brand model  year    price
0  Mercedes     X  2011  1050.15
1   Renault     Y  2010  1000.40
2      Ford     Z  2009  2000.30
3  Mercedes     X  2010  1000.00
4  Mercedes     X  2012  1100.30
5  Mercedes     X  2020  3000.50
6   Renault     Q  2011      NaN

But this solution takes a long time for 90,000 rows and 27 columns so, is there a more effective solution? For instance, can I use groupby for the values year-1, year, year 1, brand and model?

Thanks in advance.

CodePudding user response:

I think actually a more efficient way would be to sort by Brand and then Year, and then use interpolate:

df = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))

Output:

>>> df
      brand model  year    price
0  Mercedes     X  2011  1050.15
1   Renault     Y  2010  1000.40
2      Ford     Z  2009  2000.30
3  Mercedes     X  2010  1000.00
4  Mercedes     X  2012  1100.30
5  Mercedes     X  2020  3000.50
6   Renault     Q  2011  1000.40

That also handles all the columns.

  • Related