I have a table like the one below. Each row contains the temperature in a city and the date. A given date can be duplicated, but for each city the temperature the same day is the same. I want a new column with the change in temperature from the day before in that day. For example, for 2 January 2019 (rows 4-7), the change in T for city 1 is 5º (20-15), and for city 2 the change is 1º (19-18).
I've tried pandas using grouping, transforming and merge operations, but cannot get it work. Of course, a for loop works, but it's quite slow. I would also like other columns with changes in Tº of more than 1 day.
Index | Date | City | Temp | Temp Diff 1 Day |
---|---|---|---|---|
1 | 01/01/2019 | 1 | 15 | na |
2 | 01/01/2019 | 1 | 15 | na |
3 | 01/01/2019 | 2 | 18 | na |
4 | 01/01/2019 | 3 | 10 | na |
5 | 02/01/2019 | 1 | 20 | 5 (20-15) |
6 | 02/01/2019 | 2 | 19 | 1 (19-18) |
7 | 02/01/2019 | 2 | 19 | 1 (19-18) |
8 | 02/01/2019 | 2 | 25 | 1 |
9 | 03/01/2019 | 3 | 22 | na (nothing 2 Jan) |
10 | 03/01/2019 | 1 | 22 | 2 (22-20) |
Edit: I'm sorry, I've not included the fact that there can be no info for one city on a given date. I've inserted a row for city=3 the first day (1 Jan) and another for 3 Jan. Because city=3 has nothing on 2 Jan, for day 3 Jan it should report NA.
CodePudding user response:
One way to achieve this is as follows:
import pandas as pd
data = {'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9},
'Date': {0: '01/01/2019', 1: '01/01/2019', 2: '01/01/2019',
3: '02/01/2019', 4: '02/01/2019', 5: '02/01/2019',
6: '02/01/2019', 7: '03/01/2019', 8: '03/01/2019'},
'City': {0: 1, 1: 1, 2: 2, 3: 1, 4: 2, 5: 2, 6: 3, 7: 1, 8: 1},
'Temp': {0: 15, 1: 15, 2: 18, 3: 20, 4: 19, 5: 19, 6: 25, 7: 22, 8: 22},
'Temp Diff 1 Day': {0: 'na', 1: 'na', 2: 'na', 3: '5 (20-15)',
4: '1 (19-18)', 5: '1 (19-18)', 6: 'na',
7: '2 (22-20)', 8: '2 (22-20)'}}
df = pd.DataFrame(data)
# create dupl series that keeps first
dupl_keep_first = df.duplicated(subset=['Date','City'], keep='first')
# create another that keeps none
dupl_keep_false = df.duplicated(subset=['Date','City'], keep=False)
# get diff on groupby all rows *except* dup_keep_first (N.B.: ~ operator "flips" True/False)
df['Temp Diff 1 Day'] = df.loc[~dupl_keep_first].groupby('City')['Temp'].diff()
# now select *all* duplicates and ffill to add same diff values to all dupl rows
df.loc[dupl_keep_false, 'Temp Diff 1 Day'] = \
df.loc[dupl_keep_false, 'Temp Diff 1 Day'].ffill()
print(df)
Index Date City Temp Temp Diff 1 Day
0 1 01/01/2019 1 15 NaN
1 2 01/01/2019 1 15 NaN
2 3 01/01/2019 2 18 NaN
3 4 02/01/2019 1 20 5.0
4 5 02/01/2019 2 19 1.0
5 6 02/01/2019 2 19 1.0
6 7 02/01/2019 3 25 NaN
7 8 03/01/2019 1 22 2.0
8 9 03/01/2019 1 22 2.0
CodePudding user response:
# Group by city, perform a transform where you drop duplicates and find the diff.
df['Temp_diff'] = df.groupby('City')['Temp'].transform(lambda x: x.drop_duplicates().diff())
# Then group by city and date, and ffill the values.
df['Temp_diff'] = df.groupby(['City', 'Date'])['Temp_diff'].ffill()
print(df)
Output:
Index Date City Temp Temp_diff
0 1 2019-01-01 1 15 NaN
1 2 2019-01-01 1 15 NaN
2 3 2019-01-01 2 18 NaN
3 4 2019-02-01 1 20 5.0
4 5 2019-02-01 2 19 1.0
5 6 2019-02-01 2 19 1.0
6 7 2019-02-01 3 25 NaN
7 8 2019-03-01 1 22 2.0
8 9 2019-03-01 1 22 2.0