I have a Dateframe ,you can have it ,by runnnig:
import pandas as pd
from io import StringIO
df = """
case_id first_created last_paid submitted_time
3456 2021-01-27 2021-01-29 2021-01-26 21:34:36.566023 00:00
7891 2021-08-02 2021-09-16 2022-10-26 19:49:14.135585 00:00
1245 2021-09-13 None 2022-10-31 02:03:59.620348 00:00
9073 None None 2021-09-12 10:25:30.845687 00:00
"""
df= pd.read_csv(StringIO(df.strip()), sep='\s\s ', engine='python')
df
The logic is create 2 new columns for each row:
df['create_duration']=df['submitted_time']-df['first_created']
df['paid_duration']=df['submitted_time']-df['last_paid']
The unit need to be days.
My changeling is sometime the last_paid or first_created will be none,how to skip the none value in the same row ,but still keep computing the another column ,if its value is not none ?
For example ,the last_paid in the third row is none ,but first_created is not,so for this row:
df['create_duration']=df['submitted_time']-df['first_created']
df['paid_duration']='N/A'
CodePudding user response:
You can use:
submitted = pd.to_datetime(df['submitted_time'], errors='coerce', utc=True).dt.tz_localize(None)
df['create_duration'] = submitted.sub(pd.to_datetime(df['first_created'], errors='coerce')).dt.days
df['paid_duration'] = submitted.sub(pd.to_datetime(df['last_paid'], errors='coerce')).dt.days
Output:
case_id first_created last_paid submitted_time create_duration paid_duration
0 3456 2021-01-27 2021-01-29 2021-01-26 21:34:36.566023 00:00 -1.0 -3.0
1 7891 2021-08-02 2021-09-16 2022-10-26 19:49:14.135585 00:00 450.0 405.0
2 1245 2021-09-13 None 2022-10-31 02:03:59.620348 00:00 413.0 NaN
3 9073 None None 2021-09-12 10:25:30.845687 00:00 NaN NaN