Home > Net >  How can I leave dates and merge non-dates into another column of df?
How can I leave dates and merge non-dates into another column of df?

Time:03-17

I would like to leave date formatted items in "Dirty Dates" and merge any non-date values as str with existing data in "Comments & Junk" along with " | " to separate values if both have value. Please no lectures on data source etc., these are excel files emailed to us from many suppliers and they don't all follow the rules for date columns.

here is the raw df:
df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})

I have tried adding a 3rd temporary column "merge" to conditionally combine the two columns and add some formatting (a pipe separator with space before / after) in hopes I could use it to get my result and then drop it later, this column is not needed in the final result other than for processing so feel free to avoid or drop as needed.

Here is the code for what I have tried:

df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})
df_t1 = df_raw.copy(deep=True)
df_t1['Dirty Dates'] = pd.to_datetime(df_t1['Dirty Dates'], errors='coerce')
df_t1['Dirty Dates'] = df_t1['Dirty Dates'].apply(lambda x: x if isinstance(x,datetime.datetime) else np.nan)
if df_t1['Comments & Junk'].isnull:
    df_t1['Merge'] = df_t1['Dirty Dates'].astype(str)
else:
    df_t1['Merge'] = df_t1['Dirty Dates'].astype(str)   ' | '   df_t1['Comments & Junk']

print(df_raw)
print(df_t1)

Here is what the desired final output should look like: enter image description here

CodePudding user response:

I actually modified your df_raw a little to add two edge cases since this may also happen in your data (see second code block).

This is what i would do:

def handle(dirty_dates, comm_junk):
    dates = pd.to_datetime(dirty_dates, errors="coerce")
    isna = dates.isna()  # errors= "coerce" will make all nondates NaT
    nondates = dirty_dates.where(isna, np.nan)

    # then you want | in between any two strings but it could be an NaN or an empty string
    toadd = comm_junk.ne("", fill_value="")
    comm_junk = comm_junk.where(~(isna & toadd), comm_junk   " | "   nondates)
    # but you don't want a pointless | so handle it seperately
    comm_junk.loc[isna & ~toadd] = nondates

    return dates, comm_junk

From the console:

>> df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk', 'other'],'Comments & Junk':['','stuff',np.nan,'things',np.nan, ""]})

>> df_raw
  Dirty Dates Comments & Junk
0     1/21/22                
1      3-1-22           stuff
2      22-4-7             NaN   # combination of date and NaN
3        junk          things
4   more junk             NaN
5       other                   # combination of nondate and ""

>> dates, comm_junk = handle(df_raw["Dirty Dates"], df_raw["Comments & Junk"])

>> dates
0   2022-01-21
1   2022-03-01
2   2007-04-22
3          NaT
4          NaT
5          NaT
Name: Dirty Dates, dtype: datetime64[ns]

>> comm_junk
0                 
1            stuff
2              NaN
3    things | junk
4        more junk
5            other
Name: Comments & Junk, dtype: object

Let me know in a comment if you encounter an issue.

  • Related