Home > Software engineering >  Combine rows with consecutive dates based on condition
Combine rows with consecutive dates based on condition

Time:10-08

I would like to combine rows of same id with consecutive dates and same features values.

I have the following dataframe:

    Id      Start       End         Feature1  Feature2
0   A       2020-01-01  2020-01-15  1         1
1   A       2020-01-16  2020-01-30  1         1
2   A       2020-01-31  2020-02-15  0         1
3   A       2020-07-01  2020-07-15  0         1
4   B       2020-01-31  2020-02-15  0         0
5   B       2020-02-16  NaT         0         0

An the expected result is:

    Id      Start       End         Feature1  Feature2
0   A       2020-01-01  2020-01-30  1         1
1   A       2020-01-31  2020-02-15  0         1
2   A       2020-07-01  2020-07-15  0         1
3   B       2020-01-31  NaT         0         0

I have been trying other posts answers but they don't really match with my use case.

Thanks in advance!

CodePudding user response:

Extract months from both date column

df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month

Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result

df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)

Result

  Id  Feature1  Feature2       Start         End
0  A         0         1  2020-01-31  2020-02-15
1  A         0         1  2020-07-01  2020-07-15
2  A         1         1  2020-01-01  2020-01-30
3  B         0         0  2020-01-31  2020-02-15

CodePudding user response:

You can approach by:

  1. Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().

  2. Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.

  3. Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()

As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.

# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])

# sort by columns `Id` and `Start` if not already in this sequence 
df = df.sort_values(['Id', 'Start'])

day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days

group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()

df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
            .agg({'Id': 'first',
                  'Start': 'first',
                  'End': lambda x: x.iloc[-1],
                  'Feature1': 'first',
                  'Feature2': 'first',
                }))

Result:

print(df_out)

  Id      Start        End  Feature1  Feature2
0  A 2020-01-01 2020-01-30         1         1
1  A 2020-01-31 2020-02-15         0         1
2  A 2020-07-01 2020-07-15         0         1
3  B 2020-01-31        NaT         0         0
  • Related