Home > Net >  Difference with nearest conditional rows per group using Pandas
Difference with nearest conditional rows per group using Pandas

Time:10-01

I have a dataframe (sample) like this:

import pandas as pd

data = [['A', '2022-09-01', False, 2], ['A', '2022-09-02', False, 3], ['A', '2022-09-03', True, 1],
        ['A', '2022-09-05', False, 4], ['A', '2022-09-08', True, 4], ['A', '2022-09-09', False, 2],
        ['B', '2022-09-03', False, 4], ['B', '2022-09-05', True, 5], ['B', '2022-09-06', False, 7],
        ['B', '2022-09-09', True, 4], ['B', '2022-09-10', False, 2], ['B', '2022-09-11', False, 3]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'val'])

   group        date  indicator  val
0      A  2022-09-01      False    2
1      A  2022-09-02      False    3
2      A  2022-09-03       True    1
3      A  2022-09-05      False    4
4      A  2022-09-08       True    4
5      A  2022-09-09      False    2
6      B  2022-09-03      False    4
7      B  2022-09-05       True    5
8      B  2022-09-06      False    7
9      B  2022-09-09       True    4
10     B  2022-09-10      False    2
11     B  2022-09-11      False    3

I would like to create a column called Diff, which shows the difference of rows with its nearest (depends on date) conditional rows (indicator == True) where the conditional rows have a value of 0 per group. Here is the desired output:

data = [['A', '2022-09-01', False, 2, 1], ['A', '2022-09-02', False, 3, 2], ['A', '2022-09-03', True, 1, 0],
        ['A', '2022-09-05', False, 4, 3], ['A', '2022-09-08', True, 4, 0], ['A', '2022-09-09', False, 2, -2],
        ['B', '2022-09-03', False, 4, -1], ['B', '2022-09-05', True, 5, 0], ['B', '2022-09-06', False, 7, 2],
        ['B', '2022-09-09', True, 4, 0], ['B', '2022-09-10', False, 2, -2], ['B', '2022-09-11', False, 3, -1]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'val', 'Diff'])

   group        date  indicator  val  Diff
0      A  2022-09-01      False    2     1
1      A  2022-09-02      False    3     2
2      A  2022-09-03       True    1     0
3      A  2022-09-05      False    4     3
4      A  2022-09-08       True    4     0
5      A  2022-09-09      False    2    -2
6      B  2022-09-03      False    4    -1
7      B  2022-09-05       True    5     0
8      B  2022-09-06      False    7     2
9      B  2022-09-09       True    4     0
10     B  2022-09-10      False    2    -2
11     B  2022-09-11      False    3    -1

As you can see it returns the difference respectively with the nearest indicator == True rows per group where the conditioned rows have a Diff of 0. So I was wondering if anyone knows have to get the desired result using pandas?

Extra info column Diff:

Let's take group A as an example. The column Diff is calculated by the difference with respect to the nearest row with indicator True. So for example:

  • Row 1 is 2 - 1 = 1 with respect to row 3 (nearest True row based on date).
  • Row 2 is 3 - 1 = 2 with respect to row 3.
  • Row 4 is 4 - 1 = 3 with respect to row 3.
  • Row 6 is 2 - 4 = -2 with respect to row 5 (nearest True row based on date).
  • The rows with True have a value of 0 in Diff because everything is calculated with respect to these rows.

CodePudding user response:

IIUC use merge_asof with filtered rows by indicator and subtract column val:

df['date'] = pd.to_datetime(df['date'] )

df = df.sort_values('date')

df['Diff'] = df['val'].sub(pd.merge_asof(df, 
              df[df['indicator']], on='date', by='group', direction='nearest')['val_y'])
df = df.sort_index()
  • Related