Diff() function use with groupby for pandas-CodePudding

I am encountering an errors each time i attempt to compute the difference in readings for a meter in my dataset. The dataset structure is this.

id  paymenttermid   houseid     houseid-meterid     quantity    month   year    cleaned_quantity
Datetime                                
2019-02-01  255     water   215     215M201     23.0    2   2019    23.0
2019-02-01  286     water   193     193M181     24.0    2   2019    24.0
2019-02-01  322     water   172     172M162     22.0    2   2019    22.0
2019-02-01  323     water   176     176M166     61.0    2   2019    61.0
2019-02-01  332     water   158     158M148     15.0    2   2019    15.0

I am attempting to generate a new column called consumption that computes the difference in quantities consumed for each house(identified by houseid-meterid) after every month of the year.

The code i am using to implement this is:

water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)

After executing this code, the consumption column is filled with NaN values. How can I correctly implement this logic. The end result looks like this:

id  paymenttermid   houseid     houseid-meterid     quantity    month   year    cleaned_quantity    consumption
Datetime                                    
2019-02-01  255     water   215     215M201     23.0    2   2019    23.0    NaN
2019-02-01  286     water   193     193M181     24.0    2   2019    24.0    NaN
2019-02-01  322     water   172     172M162     22.0    2   2019    22.0    NaN
2019-02-01  323     water   176     176M166     61.0    2   2019    61.0    NaN
2019-02-01  332     water   158     158M148     15.0    2   2019    15.0    NaN

Many thank in advance.

I have attempted to use

water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)

and

water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(0)

and

water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff()

all this commands result in the same behaviour as stated above.

Expected output should be:



Datetime    houseid-meterid cleaned_quantity    consumption                             
2019-02-01    215M201         23.0              20
2019-03-02    215M201         43.0              9
2019-04-01    215M201         52.0              12
2019-05-01    215M201         64.0              36
2019-06-01    215M201         100.0             20

what steps should i take?

CodePudding user response：

Sort values by Datetime (if needed) then group by houseid-meterid before compute the diff for cleaned_quantity values then shift row to align with the right data:

df['consumption'] = (df.sort_values('Datetime')
                       .groupby('houseid-meterid')['cleaned_quantity']
                       .transform(lambda x: x.diff().shift(-1)))
print(df)

# Output
    Datetime houseid-meterid  cleaned_quantity  consumption
0 2019-02-01         215M201              23.0         20.0
1 2019-03-02         215M201              43.0          9.0
2 2019-04-01         215M201              52.0         12.0
3 2019-05-01         215M201              64.0         36.0
4 2019-06-01         215M201             100.0          NaN