I would like to leave date formatted items in "Dirty Dates" and merge any non-date values as str with existing data in "Comments & Junk" along with " | " to separate values if both have value. Please no lectures on data source etc., these are excel files emailed to us from many suppliers and they don't all follow the rules for date columns.
here is the raw df:
df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})
I have tried adding a 3rd temporary column "merge" to conditionally combine the two columns and add some formatting (a pipe separator with space before / after) in hopes I could use it to get my result and then drop it later, this column is not needed in the final result other than for processing so feel free to avoid or drop as needed.
Here is the code for what I have tried:
df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})
df_t1 = df_raw.copy(deep=True)
df_t1['Dirty Dates'] = pd.to_datetime(df_t1['Dirty Dates'], errors='coerce')
df_t1['Dirty Dates'] = df_t1['Dirty Dates'].apply(lambda x: x if isinstance(x,datetime.datetime) else np.nan)
if df_t1['Comments & Junk'].isnull:
df_t1['Merge'] = df_t1['Dirty Dates'].astype(str)
else:
df_t1['Merge'] = df_t1['Dirty Dates'].astype(str) ' | ' df_t1['Comments & Junk']
print(df_raw)
print(df_t1)
Here is what the desired final output should look like:
CodePudding user response:
I actually modified your df_raw a little to add two edge cases since this may also happen in your data (see second code block).
This is what i would do:
def handle(dirty_dates, comm_junk):
dates = pd.to_datetime(dirty_dates, errors="coerce")
isna = dates.isna() # errors= "coerce" will make all nondates NaT
nondates = dirty_dates.where(isna, np.nan)
# then you want | in between any two strings but it could be an NaN or an empty string
toadd = comm_junk.ne("", fill_value="")
comm_junk = comm_junk.where(~(isna & toadd), comm_junk " | " nondates)
# but you don't want a pointless | so handle it seperately
comm_junk.loc[isna & ~toadd] = nondates
return dates, comm_junk
From the console:
>> df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk', 'other'],'Comments & Junk':['','stuff',np.nan,'things',np.nan, ""]})
>> df_raw
Dirty Dates Comments & Junk
0 1/21/22
1 3-1-22 stuff
2 22-4-7 NaN # combination of date and NaN
3 junk things
4 more junk NaN
5 other # combination of nondate and ""
>> dates, comm_junk = handle(df_raw["Dirty Dates"], df_raw["Comments & Junk"])
>> dates
0 2022-01-21
1 2022-03-01
2 2007-04-22
3 NaT
4 NaT
5 NaT
Name: Dirty Dates, dtype: datetime64[ns]
>> comm_junk
0
1 stuff
2 NaN
3 things | junk
4 more junk
5 other
Name: Comments & Junk, dtype: object
Let me know in a comment if you encounter an issue.