Home > OS >  Merging two series with alternating dates into one grouped Pandas dataframe
Merging two series with alternating dates into one grouped Pandas dataframe

Time:06-22

Given are two series, like this:

#period1
DATE
2020-06-22    310.62
2020-06-26    300.05
2020-09-23    322.64
2020-10-30    326.54

#period2
DATE
2020-06-23    312.05
2020-09-02    357.70
2020-10-12    352.43
2021-01-25    384.39

These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.

I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.

In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.

Explicit output:

DATE          CHG       PERIOD
2020-06-22    NaN       1
2020-06-23    1.43      2
2020-06-26    12.0      1
2020-09-02    57.65     2
2020-09-23    35.06     1       
2020-10-12    29.79     2
2020-10-30    25.89     1
2021-01-25    57.85     2

However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable

Thank you!

CodePudding user response:

p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})

p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)

Output:

    Date     period CHG
0   2020-06-22  1   NaN
1   2020-06-23  2   1.43
2   2020-06-26  1   12.00
3   2020-09-02  2   57.65
4   2020-09-23  1   35.06
5   2020-10-12  2   29.79
6   2020-10-30  1   25.89
7   2021-01-25  2   57.85

EDIT: matching the format START - STOP - CHANGE - PERIOD

Starting from the above data frame:

df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df

Output:

    Start       Stop        CHG     period
0   NaN         2020-06-22  NaN     1
1   2020-06-22  2020-06-23  1.43    2
2   2020-06-23  2020-06-26  12.00   1
3   2020-06-26  2020-09-02  57.65   2
4   2020-09-02  2020-09-23  35.06   1
5   2020-09-23  2020-10-12  29.79   2
6   2020-10-12  2020-10-30  25.89   1
7   2020-10-30  2021-01-25  57.85   2

CodePudding user response:

# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)

df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)

Output:

              CNG  PERIOD
Date
2020-06-22    NaN       1
2020-06-23   1.43       2
2020-06-26  12.00       1
2020-09-02  57.65       2
2020-09-23  35.06       1
2020-10-12  29.79       2
2020-10-30  25.89       1
2021-01-25  57.85       2
2021-01-29  14.32       1
2021-02-12  22.57       2
2021-03-04  15.94       1
2021-05-07  45.42       2
2021-05-12  16.71       1
2021-09-02  47.78       2
2021-10-04  24.55       1
2021-11-18  41.09       2
2021-12-01  19.23       1
2021-12-10  20.24       2
2021-12-20  15.76       1
2022-01-03  22.73       2
2022-01-27  46.47       1
2022-02-09  26.30       2
2022-02-23  35.59       1
2022-03-02  15.94       2
2022-03-08  21.64       1
2022-03-29  45.30       2
2022-04-29  49.55       1
2022-05-04  17.06       2
2022-05-12  36.72       1
2022-05-17  15.98       2
2022-05-19  18.86       1
2022-06-02  27.93       2
2022-06-17  51.53       1
  • Related