Home > OS >  Creating new columns in Pandas based on difference of different row values
Creating new columns in Pandas based on difference of different row values

Time:11-02

I have the following dataframe in pandas

df = pd.DataFrame({
    "Name": [ "N1", "N2", "N3", "N1", "N1", "N2", "N3", "N2" ],
    "Date": [ "31-10-2021", "31-10-2021" , "31-10-2021", "15-10-2021", "14-10-2021", "13-10-2021", "12-10-2021", "11-10-2021" ],
    "Feature": [ 4, 5, 6, 3, 1, 6, 3, 3 ]
})
          Name Date       Feature
0         N1   31-10-2021 4
1         N2   31-10-2021 5
2         N3   31-10-2021 6
3         N1   15-10-2021 3
4         N1   14-10-2021 1
5         N2   13-10-2021 6
6         N3   12-10-2021 3
7         N2   11-10-2021 3

I want to create a new column based on the difference between the current value of the feature of a given name and the value of the feature of that name appeared last time in the dataframe and zero otherwise.

so given the above table, it should be:

          Name Date       Feature New_column
0         N1   31-10-2021 4       1
1         N2   31-10-2021 5       -1
2         N3   31-10-2021 6       3
3         N1   15-10-2021 3       2
4         N1   14-10-2021 1       0
5         N2   13-10-2021 6       3
6         N3   12-10-2021 3       0
7         N2   11-10-2021 3       0

Is there a vectorized/ efficient way to do this? Thanks in advance.

CodePudding user response:

You can use shift with groupby

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Name": [ "N1", "N2", "N3", "N1", "N1", "N2", "N3", "N2" ],
    "Date": [ "31-10-2021", "31-10-2021" , "31-10-2021", "15-10-2021", "14-10-2021", "13-10-2021", "12-10-2021", "11-10-2021" ],
    "Feature": [ 4, 5, 6, 3, 1, 6, 3, 3 ]
})

df.sort_values(by = ['Name', 'Date'], inplace = True)

df['New_column'] = df['Feature'] - df.groupby('Name')['Feature'].shift()

df['New_column'] = df['New_column'].replace(np.nan, 0, regex = True)

The last line of the code is because the first row by Name will have a NaN, but in your example you showed you would like to have a 0.

CodePudding user response:

We can do

result_df = df.assign(New_column=df.sort_values('Date', ascending=False)
                                   .groupby('Name')['Feature'].diff().fillna(0))
  • Related