Home > OS >  Creating Start & End Date features out from one date column
Creating Start & End Date features out from one date column

Time:11-03

I'm trying to make a new feature out from one date column to provide start and end date.

Here's what it looks like:

unique_id = ["001", "001", "001",
             "002",
             "003", "003"
            ]
end_dates = ["2018-10-31 12:43:03 PM", "2018-10-31 12:44:23 PM", "2018-10-31 1:01:42 PM",
             "2018-11-23 03:33:13 PM",
             "2018-11-23 04:10:45 PM", "2018-11-23 04:13:58 PM"
             
            ]
activity_class = ["step 1", "step 2", "step 3",
                  "step 1",
                  "step 1", "step 2"
                 ]

df = \
pd.DataFrame({"ID": unique_id,
              "Edit Date": end_dates,
              "Activity": activity_class
             })

df["Edit Date"] = pd.to_datetime(df["Edit Date"])

Here's how I want it to look like:

unique_id = ["001", "001", "001",
             "002",
             "003", "003"
            ]


start_date = ["2018-10-31 12:43:03 PM", "2018-10-31 12:43:03 PM", "2018-10-31 12:44:23 PM",
              "2018-11-23 03:33:13 PM",
              "2018-11-23 04:10:45 PM", "2018-11-23 04:10:45 PM"
             ]

end_date = ["2018-10-31 12:43:03 PM", "2018-10-31 12:44:23 PM", "2018-10-31 1:01:42 PM",
             "2018-11-23 03:33:13 PM",
             "2018-11-23 04:10:45 PM", "2018-11-23 04:13:58 PM"
           ]

activity_class = ["step 1", "step 2", "step 3",
                  "step 1",
                  "step 1", "step 2"
                 ]

df = \
pd.DataFrame({"ID": unique_id,
              "Start_Date": start_date,
              "End_Date": end_date,
              "Activity": activity_class
             })

df["Start_Date"] = pd.to_datetime(df["Start_Date"])
df["End_Date"] = pd.to_datetime(df["End_Date"])

What I tried so far:

df["Start_Date"] = df["Edit Date"].shift(1).backfill()

Some of the rules:

  • Data is sorted ascending by unique id and date
  • regardless of the label in "activity" as long as it's the first one, it the date should be the same for start and end
  • next activity's start date should copy the previous activity's end date

CodePudding user response:

You are looking for groupby().shift()?

df['Start_Date'] = df['Edit Date']
df['End_Date'] = df.groupby('ID')['Edit Date'].shift().fillna(df['Edit Date'])

Output:

    ID           Edit Date Activity          Start_Date            End_Date
0  001 2018-10-31 12:43:03   step 1 2018-10-31 12:43:03 2018-10-31 12:43:03
1  001 2018-10-31 12:44:23   step 2 2018-10-31 12:44:23 2018-10-31 12:43:03
2  001 2018-10-31 13:01:42   step 3 2018-10-31 13:01:42 2018-10-31 12:44:23
3  002 2018-11-23 15:33:13   step 1 2018-11-23 15:33:13 2018-11-23 15:33:13
4  003 2018-11-23 16:10:45   step 1 2018-11-23 16:10:45 2018-11-23 16:10:45
5  003 2018-11-23 16:13:58   step 2 2018-11-23 16:13:58 2018-11-23 16:10:45
  • Related