Home > Software design >  Combine two datasets based on a single column time stamps which are different to each other
Combine two datasets based on a single column time stamps which are different to each other

Time:01-18

I want to combine two dataset according to similar timestamps. Both datasets have different stamps format as follow:

DF1

d1 = {'col1': ["a1-20220517-183028.spc", "a2-20220518-183028.spc","a3-20220519-183028.spc","a4-20220525-183028.spc"], 
      'col2': [a, b ,c,d]}
df_1 = pd.DataFrame(data=d1)

DF2

d2 = {'col1.2': ["18-05-2022 12:23","17-05-2022 0:16:00" ,"25-05-2022 20:23","19-05-2022 14:54","02-05-2022 12:14"], 
      'col2.2': [12, 2 ,3,10, 1]}
df_2 = pd.DataFrame(data=d2)

I want to combine df_1 and df_2 according to the timestamp of col1 in DF1 AND col1.2 in DF2, so the resulting output is given as follow:

d_output = {'col1': ["17-05-2022", "18-05-2022", "19-05-2022", "25-05-2022"], 
            'col2': [a, b ,c , d], 
            'col3': [2, 12 ,10, 3 ]}

df_output=pd.DataFrame(data=d)

I have tried extracting the timestamp of one and formatting it into the same format but I am really struggling to format it correctly and merge both datasets with a similar timestamp.

Anyway I could solve this?

Thanks a lot!

CodePudding user response:

This is a merge with a bit of pre-processing using str.extract and pd.to_datetime:

out = (
 df_1.assign(col1=pd.to_datetime(df_1['col1'].str.extract('-(\d{8})', expand=False))
                    .dt.strftime('%d-%m-%Y'))
     .merge(
  df_2.rename(columns={'col1.2': 'col1', 'col2.2': 'col3'})
      .assign(col1=lambda d: pd.to_datetime(d['col1']).dt.strftime('%d-%m-%Y')),
         how='left', on='col1'
     )
)

Output:

         col1 col2  col3
0  17-05-2022    a     2
1  18-05-2022    b    12
2  19-05-2022    c    10
3  25-05-2022    d     3

CodePudding user response:

You can use the pd.to_datetime function to convert the timestamps in both df_1 and df_2 to a common format, then use the pd.merge function to join the two dataframes on the common timestamp column.

For example like this:

df_1['timestamp'] = pd.to_datetime(df_1['col1'].str.extract(r'(\d{4}\d{2}\d{2}-\d{6})')[0], format='%Y%m%d-%H%M%S')

df_2['timestamp'] = pd.to_datetime(df_2['col1.2'], format='%d-%m-%Y %H:%M:%S')

df_output = pd.merge(df_1, df_2, on='timestamp')
  • Related