I have the following dataframe in pandas
df = pd.DataFrame({
"Name": [ "N1", "N2", "N3", "N1", "N1", "N2", "N3", "N2" ],
"Date": [ "31-10-2021", "31-10-2021" , "31-10-2021", "15-10-2021", "14-10-2021", "13-10-2021", "12-10-2021", "11-10-2021" ],
"Feature": [ 4, 5, 6, 3, 1, 6, 3, 3 ]
})
Name Date Feature
0 N1 31-10-2021 4
1 N2 31-10-2021 5
2 N3 31-10-2021 6
3 N1 15-10-2021 3
4 N1 14-10-2021 1
5 N2 13-10-2021 6
6 N3 12-10-2021 3
7 N2 11-10-2021 3
I want to create a new column based on the difference between the current value of the feature of a given name and the value of the feature of that name appeared last time in the dataframe and zero otherwise.
so given the above table, it should be:
Name Date Feature New_column
0 N1 31-10-2021 4 1
1 N2 31-10-2021 5 -1
2 N3 31-10-2021 6 3
3 N1 15-10-2021 3 2
4 N1 14-10-2021 1 0
5 N2 13-10-2021 6 3
6 N3 12-10-2021 3 0
7 N2 11-10-2021 3 0
Is there a vectorized/ efficient way to do this? Thanks in advance.
CodePudding user response:
You can use shift
with groupby
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Name": [ "N1", "N2", "N3", "N1", "N1", "N2", "N3", "N2" ],
"Date": [ "31-10-2021", "31-10-2021" , "31-10-2021", "15-10-2021", "14-10-2021", "13-10-2021", "12-10-2021", "11-10-2021" ],
"Feature": [ 4, 5, 6, 3, 1, 6, 3, 3 ]
})
df.sort_values(by = ['Name', 'Date'], inplace = True)
df['New_column'] = df['Feature'] - df.groupby('Name')['Feature'].shift()
df['New_column'] = df['New_column'].replace(np.nan, 0, regex = True)
The last line of the code is because the first row by Name will have a NaN, but in your example you showed you would like to have a 0.
CodePudding user response:
We can do
result_df = df.assign(New_column=df.sort_values('Date', ascending=False)
.groupby('Name')['Feature'].diff().fillna(0))