Let be the following DataFrame, where the time column is defined by a Python TimeDelta object:
ID | date | direction | time |
---|---|---|---|
0 | 2022-01-02 | IN | NaT |
0 | 2022-01-03 | OUT | 1 days |
0 | 2022-01-04 | IN | NaT |
0 | 2022-01-08 | OUT | 4 days |
1 | 2022-03-02 | IN | NaT |
1 | 2022-03-05 | OUT | 3 days |
2 | 2022-04-06 | IN | NaT |
2 | 2022-04-10 | OUT | 4 days |
I need to iterate over the existing DataFrame to create another one in the most efficient way containing information from rows two by two, i.e:
ID | start_date | end_date | time |
---|---|---|---|
0 | 2022-01-02 | 2022-01-03 | 1 days |
0 | 2022-01-04 | 2022-01-08 | 4 days |
1 | 2022-03-02 | 2022-03-05 | 3 days |
2 | 2022-04-06 | 2022-04-10 | 4 days |
Assume that the original DataFrame is contained in df
. I need a way to iterate over the rows two by two, being able to query in each iteration the information of the first row and its next one. I hope you can help me, thank you in advance.
CodePudding user response:
You can use df.loc
df.shift
to create final dataframe:
df_out = pd.DataFrame(
{
"ID": df.loc[::2, "ID"],
"start_date": df.loc[::2, "date"],
"end_date": df.shift(-1).loc[::2, "date"],
"time": df.shift(-1).loc[::2, "time"],
}
)
print(df_out)
Prints:
ID start_date end_date time
0 0 2022-01-02 2022-01-03 1 days
2 0 2022-01-04 2022-01-08 4 days
4 1 2022-03-02 2022-03-05 3 days
6 2 2022-04-06 2022-04-10 4 days
Or:
df_out = pd.DataFrame(
{
"ID": df.loc[::2, "ID"].values,
"start_date": df.loc[::2, "date"].values,
"end_date": df.loc[1::2, "date"].values,
"time": df.loc[1::2, "time"].values,
}
)