Home > other >  In pandas, how to groupby and apply/transform on each whole group (NOT aggregation)?
In pandas, how to groupby and apply/transform on each whole group (NOT aggregation)?

Time:12-19

I've looked into agg/apply/transform after groupby, but none of them seem to meet my need. Here is an example DF:

df_seq = pd.DataFrame({
    'person':['Tom', 'Tom', 'Tom', 'Lucy', 'Lucy', 'Lucy'],
    'day':[1,2,3,1,4,6],
    'food':['beef', 'lamb', 'chicken', 'fish', 'pork', 'venison']
})

person,day,food
Tom,1,beef
Tom,2,lamb
Tom,3,chicken
Lucy,1,fish
Lucy,4,pork
Lucy,6,venison

The day column shows that, for each person, he/she consumes food in sequential orders.

Now I would like to group by the person col, and create a DataFrame which contains food pairs for two neighboring days/time (as shown below).

Note the day column is only for example purpose here so the values of it should not be used. It only means the food column is in sequential order. In my real data, it's a datetime column.

person,day,food,food_next
Tom,1,beef,lamb
Tom,2,lamb,chicken
Lucy,1,fish,pork
Lucy,4,pork,venison

At the moment, I can only do this with a for-loop to iterate through all users. It's very slow.

Is it possible to use a groupby and apply/transform to achieve this, or any vectorized operations?

CodePudding user response:

Create new column by DataFrameGroupBy.shift and then remove rows with missing values in food_next by DataFrame.dropna:

df = (df_seq.assign(food_next = df_seq.groupby('person')['food'].shift(-1))
            .dropna(subset=['food_next']))
print (df)
  person  day  food food_next
0    Tom    1  beef      lamb
1    Tom    2  lamb   chicken
3   Lucy    1  fish      pork
4   Lucy    4  pork   venison

CodePudding user response:

This might be a slightly patchy answer, and it doesn't perform an aggregation in the standard sense.

First, a small querying function that, given a name and a day, will return the first result (assuming the data is pre-sorted) that matches the parameters, and failing that, returns some default value:

def get_next_food(df, person, day):
    results = df.query(f"`person`=='{person}' and `day`>{day}")
    if len(results)>0:
        return results.iloc[0]['food']
    else:
        return "Mystery"

You can use this as follows:

get_food(df_seq,"Tom", 1)

> 'lamb'

Now, we can use this in an apply statement, to populate a new column with the results of this function applied row-wise:

df_seq['next_food']=df_seq.apply(lambda x : get_food(df_seq, x['person'], x['day']), axis=1)

>
  person  day     food next_food
0    Tom    1     beef      lamb
1    Tom    2     lamb   chicken
2    Tom    3  chicken   Mystery
3   Lucy    1     fish      pork
4   Lucy    4     pork   venison
5   Lucy    6  venison   Mystery

Give it a try, I'm not convinced you'll see a vast performance improvement, but it'd be interesting to find out.

  • Related