I have a dataframe with 100 columns filled with start dates. I’m trying to find the next date for each value compared against that row to produce another data frame of end dates. If there is no next date it will add 1 day.
Start dates :
| gas station | park | beach | store
| Car A | 1/1/2022 | 1/4/2021 | 1/2/2021 | 1/3/2021
| Car B | 2/14/2021 | 2/10/2021| 2/21/2021| 2/5/2021
Stop dates:
| gas station | park | beach | store
| Car A | 1/2/2022 | 1/5/2021 | 1/3/2021 | 1/4/2021
| Car B | 2/21/2021 | 2/14/2021| 2/22/2021| 2/10/2021
Explanation : The “start dates” is the current dataframe. Car A arrived to the column name locations on the dates shown. Same with car B. I want to create a new dataframe (“stop dates”) based on the start dates. Car A gas station start date compared against all other columns to find the next greatest date. That next greatest date will populate the “stop date” dataframe for car A gas station, etc
CodePudding user response:
You can write a custom function that takes in a row as an input, and returns the desired row as a pd.Series
, then apply this function to each row using df.apply
with the argument axis=1
. Also I believe you may have made a typo with your start dates and the first entry should be from the year 2021 as well. Otherwise, the next date after 1/4/2021
in the park
column in the same row would be 1/1/2022
.
For example:
import numpy as np
import pandas as pd
## recreate your start dates dataframe
df_start = pd.DataFrame(
columns=['gas station','park','beach','store'],
data = [
['1/1/2021','1/4/2021','1/2/2021','1/3/2021'],
['2/14/2021','2/10/2021','2/21/2021','2/5/2021']
],
index=['Car A', 'Car B']
)
for col in df_start.columns:
df_start[col] = pd.to_datetime(df_start[col])
## custom function that takes in a row as input
## and outputs a row as a series
def get_stop_dates(row):
sorted_dates = row.unique()
sorted_dates.sort()
new_row = []
for d in row.values:
idx = np.where(sorted_dates == d)[0][0]
if idx == len(sorted_dates) - 1:
new_date = pd.to_datetime(d) pd.Timedelta("1d")
else:
new_date = pd.to_datetime(sorted_dates[idx 1])
new_row.append(new_date)
return pd.Series(new_row)
df_stop = df_start.apply(lambda row: get_stop_dates(row), axis=1)
df_stop.columns = df_start.columns
Input:
>>> df_start
gas station park beach store
Car A 2021-01-01 2021-01-04 2021-01-02 2021-01-03
Car B 2021-02-14 2021-02-10 2021-02-21 2021-02-05
Output:
>>> df_stop
gas station park beach store
Car A 2021-01-02 2021-01-05 2021-01-03 2021-01-04
Car B 2021-02-21 2021-02-14 2021-02-22 2021-02-10