Compute difference in column value for n-days before in time series in Pandas-CodePudding

I have a table like the one below. Each row contains the temperature in a city and the date. A given date can be duplicated, but for each city the temperature the same day is the same. I want a new column with the change in temperature from the day before in that day. For example, for 2 January 2019 (rows 4-7), the change in T for city 1 is 5º (20-15), and for city 2 the change is 1º (19-18).

I've tried pandas using grouping, transforming and merge operations, but cannot get it work. Of course, a for loop works, but it's quite slow. I would also like other columns with changes in Tº of more than 1 day.

Index	Date	City	Temp	Temp Diff 1 Day
1	01/01/2019	1	15	na
2	01/01/2019	1	15	na
3	01/01/2019	2	18	na
4	01/01/2019	3	10	na
5	02/01/2019	1	20	5 (20-15)
6	02/01/2019	2	19	1 (19-18)
7	02/01/2019	2	19	1 (19-18)
8	02/01/2019	2	25	1
9	03/01/2019	3	22	na (nothing 2 Jan)
10	03/01/2019	1	22	2 (22-20)

Edit: I'm sorry, I've not included the fact that there can be no info for one city on a given date. I've inserted a row for city=3 the first day (1 Jan) and another for 3 Jan. Because city=3 has nothing on 2 Jan, for day 3 Jan it should report NA.

CodePudding user response：

One way to achieve this is as follows:

import pandas as pd

data = {'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9}, 
        'Date': {0: '01/01/2019', 1: '01/01/2019', 2: '01/01/2019', 
                 3: '02/01/2019', 4: '02/01/2019', 5: '02/01/2019', 
                 6: '02/01/2019', 7: '03/01/2019', 8: '03/01/2019'}, 
        'City': {0: 1, 1: 1, 2: 2, 3: 1, 4: 2, 5: 2, 6: 3, 7: 1, 8: 1}, 
        'Temp': {0: 15, 1: 15, 2: 18, 3: 20, 4: 19, 5: 19, 6: 25, 7: 22, 8: 22}, 
        'Temp Diff 1 Day': {0: 'na', 1: 'na', 2: 'na', 3: '5 (20-15)', 
                            4: '1 (19-18)', 5: '1 (19-18)', 6: 'na', 
                            7: '2 (22-20)', 8: '2 (22-20)'}}

df = pd.DataFrame(data)

# create dupl series that keeps first
dupl_keep_first = df.duplicated(subset=['Date','City'], keep='first')

# create another that keeps none
dupl_keep_false = df.duplicated(subset=['Date','City'], keep=False)

# get diff on groupby all rows *except* dup_keep_first (N.B.: ~ operator "flips" True/False)
df['Temp Diff 1 Day'] = df.loc[~dupl_keep_first].groupby('City')['Temp'].diff()

# now select *all* duplicates and ffill to add same diff values to all dupl rows
df.loc[dupl_keep_false, 'Temp Diff 1 Day'] = \
df.loc[dupl_keep_false, 'Temp Diff 1 Day'].ffill()

print(df)

   Index        Date  City  Temp  Temp Diff 1 Day
0      1  01/01/2019     1    15              NaN
1      2  01/01/2019     1    15              NaN
2      3  01/01/2019     2    18              NaN
3      4  02/01/2019     1    20              5.0
4      5  02/01/2019     2    19              1.0
5      6  02/01/2019     2    19              1.0
6      7  02/01/2019     3    25              NaN
7      8  03/01/2019     1    22              2.0
8      9  03/01/2019     1    22              2.0

CodePudding user response：

# Group by city, perform a transform where you drop duplicates and find the diff. 
df['Temp_diff'] = df.groupby('City')['Temp'].transform(lambda x: x.drop_duplicates().diff())

# Then group by city and date, and ffill the values.
df['Temp_diff'] = df.groupby(['City', 'Date'])['Temp_diff'].ffill()

print(df)

Output:

   Index       Date  City  Temp  Temp_diff
0      1 2019-01-01     1    15        NaN
1      2 2019-01-01     1    15        NaN
2      3 2019-01-01     2    18        NaN
3      4 2019-02-01     1    20        5.0
4      5 2019-02-01     2    19        1.0
5      6 2019-02-01     2    19        1.0
6      7 2019-02-01     3    25        NaN
7      8 2019-03-01     1    22        2.0
8      9 2019-03-01     1    22        2.0