Home > Enterprise >  Concat 2 dataframes and drop duplicates consider the date and time to replace the NaN values by da
Concat 2 dataframes and drop duplicates consider the date and time to replace the NaN values by da

Time:04-20

Good morning,

I'm working on this code :

df1 = pd.DataFrame()
df1['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "01/08/2021", "01/08/2021", "02/08/2021", "02/08/2021"]
df1['Time'] = ["06:48:00", "06:48:00", "06:56:00", "06:56:00", "07:14:00", "07:14:00", "07:40:00", "07:40:00", "08:42:00", "08:42:00", "08:52:00", "08:52:00", "09:07:00", "09:07:00"]
df1["Column1"] = ['NaN', 'NaN', 0.038807581, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854, 0.000383506, 0.000543031, 0.000342, 'NaN', 'NaN']
df1["Column2"] = [0.000270475, 0.000313769,  'NaN', 'NaN', 0.000483506, 0.000643031,  0.000533131,  0.000543031,  0.000342, 0.056263517, 0.042163517, 0.035163517, 0.025163517, 0.026363517]
df2 = pd.DataFrame()
df2['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "01/08/2021", "01/08/2021", "02/08/2021", "02/08/2021"]
df2['Time'] = ["06:48:00", "06:48:00", "06:56:00", "06:56:00", "07:14:00", "07:14:00", "07:40:00", "07:40:00", "08:42:00", "08:42:00", "08:52:00", "08:52:00", "09:07:00", "09:07:00"]
df2["Column1"] = [0.041807581, 0.019607581, 'NaN', 'NaN', 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854, 0.000383506, 'NaN', 'NaN', 0.000313769, 0.000413769]
df2["Column2"] = [0.000270475, 0.000313769,  0.000383506,  0.000583506, 'NaN', 'NaN',  0.000533131,  0.000543031, 'NaN', 'NaN', 0.042163517, 0.035163517, 0.025163517, 0.026363517]
diff_df = pd.concat([df1, df2]).drop_duplicates().reset_index(drop=True)

The output is like this :

 ------------ ---------- ------------- ------------- 
|    Date    |   Time   |   Column1   |   Column2   |
 ------------ ---------- ------------- ------------- 
| 29/07/2021 | 06:48:00 | NaN         | 0,000270475 |
| 29/07/2021 | 06:48:00 | NaN         | 0,000313769 |
| 29/07/2021 | 06:56:00 | 0,038807581 | NaN         |
| 29/07/2021 | 06:56:00 | 0,018807581 | NaN         |
| 30/07/2021 | 07:14:00 | 0,025931434 | 0,000483506 |
| 30/07/2021 | 07:14:00 | 0,025163517 | 0,000643031 |
| 30/07/2021 | 07:40:00 | 0,026561283 | 0,000533131 |
| 30/07/2021 | 07:40:00 | 0,027743659 | 0,000543031 |
| 31/07/2021 | 08:42:00 | 0,028854    | 0,000342    |
| 31/07/2021 | 08:42:00 | 0,000383506 | 0,056263517 |
| 01/08/2021 | 08:52:00 | 0,000543031 | 0,042163517 |
| 01/08/2021 | 08:52:00 | 0,000342    | 0,035163517 |
| 02/08/2021 | 09:07:00 | NaN         | 0,025163517 |
| 02/08/2021 | 09:07:00 | NaN         | 0,026363517 |
| 29/07/2021 | 06:48:00 | 0,041807581 | 0,000270475 |
| 29/07/2021 | 06:48:00 | 0,019607581 | 0,000313769 |
| 29/07/2021 | 06:56:00 | NaN         | 0,000383506 |
| 29/07/2021 | 06:56:00 | NaN         | 0,000583506 |
| 30/07/2021 | 07:14:00 | 0,025931434 | NaN         |
| 30/07/2021 | 07:14:00 | 0,025163517 | NaN         |
| 31/07/2021 | 08:42:00 | 0,028854    | NaN         |
| 31/07/2021 | 08:42:00 | 0,000383506 | NaN         |
| 01/08/2021 | 08:52:00 | NaN         | 0,042163517 |
| 01/08/2021 | 08:52:00 | NaN         | 0,035163517 |
| 02/08/2021 | 09:07:00 | 0,000313769 | 0,025163517 |
| 02/08/2021 | 09:07:00 | 0,000413769 | 0,026363517 |
 ------------ ---------- ------------- ------------- 

What i need is that if there is NaN value for one of the dataframes, it will consider to use the other one value considering the fact that they have the same date and time. which is something like this :

 ------------ ---------- ------------- ------------- 
|    Date    |   Time   |   Column1   |   Column2   |
 ------------ ---------- ------------- ------------- 
| 29/07/2021 | 06:48:00 | 0,041807581 | 0,000270475 |
| 29/07/2021 | 06:48:00 | 0,019607581 | 0,000313769 |
| 29/07/2021 | 06:56:00 | 0,038807581 | 0,000383506 |
| 29/07/2021 | 06:56:00 | 0,018807581 | 0,000583506 |
| 30/07/2021 | 07:14:00 | 0,025931434 | 0,000483506 |
| 30/07/2021 | 07:14:00 | 0,025163517 | 0,000643031 |
| 30/07/2021 | 07:40:00 | 0,026561283 | 0,000533131 |
| 30/07/2021 | 07:40:00 | 0,027743659 | 0,000543031 |
| 31/07/2021 | 08:42:00 | 0,028854    | 0,000342    |
| 31/07/2021 | 08:42:00 | 0,000383506 | 0,056263517 |
| 01/08/2021 | 08:52:00 | 0,000543031 | 0,042163517 |
| 01/08/2021 | 08:52:00 | 0,000342    | 0,035163517 |
| 02/08/2021 | 09:07:00 | 0,000313769 | 0,025163517 |
| 02/08/2021 | 09:07:00 | 0,000413769 | 0,026363517 |
 ------------ ---------- ------------- ------------- 

Thank you for your time and have a great day !

EDIT######## Using these lines :

diff_df.Column1 = diff_df.Column1.fillna(diff_df.Column2)
diff_df.Column2 = diff_df.Column2.fillna(diff_df.Column1)

Will give me this output which is not what i needed :

 ------------ ---------- ------------- ------------- 
|    Date    |   Time   |   Column1   |   Column2   |
 ------------ ---------- ------------- ------------- 
| 29/07/2021 | 06:48:00 | 0,000270475 | 0,000270475 |
| 29/07/2021 | 06:48:00 | 0,000313769 | 0,000313769 |
| 29/07/2021 | 06:56:00 | 0,038807581 | 0,038807581 |
| 29/07/2021 | 06:56:00 | 0,018807581 | 0,018807581 |
| 30/07/2021 | 07:14:00 | 0,025931434 | 0,000483506 |
| 30/07/2021 | 07:14:00 | 0,025163517 | 0,000643031 |
| 30/07/2021 | 07:40:00 | 0,026561283 | 0,000533131 |
| 30/07/2021 | 07:40:00 | 0,027743659 | 0,000543031 |
| 31/07/2021 | 08:42:00 | 0,028854    | 0,000342    |
| 31/07/2021 | 08:42:00 | 0,000383506 | 0,056263517 |
| 01/08/2021 | 08:52:00 | 0,000543031 | 0,042163517 |
| 01/08/2021 | 08:52:00 | 0,000342    | 0,035163517 |
| 02/08/2021 | 09:07:00 | 0,025163517 | 0,025163517 |
| 02/08/2021 | 09:07:00 | 0,026363517 | 0,026363517 |
| 29/07/2021 | 06:48:00 | 0,041807581 | 0,000270475 |
| 29/07/2021 | 06:48:00 | 0,019607581 | 0,000313769 |
| 29/07/2021 | 06:56:00 | 0,000383506 | 0,000383506 |
| 29/07/2021 | 06:56:00 | 0,000583506 | 0,000583506 |
| 30/07/2021 | 07:14:00 | 0,025931434 | 0,025931434 |
| 30/07/2021 | 07:14:00 | 0,025163517 | 0,025163517 |
| 31/07/2021 | 08:42:00 | 0,028854    | 0,028854    |
| 31/07/2021 | 08:42:00 | 0,000383506 | 0,000383506 |
| 01/08/2021 | 08:52:00 | 0,042163517 | 0,042163517 |
| 01/08/2021 | 08:52:00 | 0,035163517 | 0,035163517 |
| 02/08/2021 | 09:07:00 | 0,000313769 | 0,025163517 |
| 02/08/2021 | 09:07:00 | 0,000413769 | 0,026363517 |
 ------------ ---------- ------------- ------------- 

CodePudding user response:

Use DataFrame.fillna with convert values to index if possible different datetimes:

df1 = df1.replace('NaN', np.nan)
df2 = df2.replace('NaN', np.nan)

df = df1.set_index(['Date','Time']).fillna(df2.set_index(['Date','Time'])).reset_index()
print (df)

          Date      Time   Column1   Column2
0   29/07/2021  06:48:00  0.041808  0.000270
1   29/07/2021  06:48:00  0.019608  0.000314
2   29/07/2021  06:56:00  0.038808  0.000384
3   29/07/2021  06:56:00  0.018808  0.000584
4   30/07/2021  07:14:00  0.025931  0.000484
5   30/07/2021  07:14:00  0.025164  0.000643
6   30/07/2021  07:40:00  0.026561  0.000533
7   30/07/2021  07:40:00  0.027744  0.000543
8   31/07/2021  08:42:00  0.028854  0.000342
9   31/07/2021  08:42:00  0.000384  0.056264
10  01/08/2021  08:52:00  0.000543  0.042164
11  01/08/2021  08:52:00  0.000342  0.035164
12  02/08/2021  09:07:00  0.000314  0.025164
13  02/08/2021  09:07:00  0.000414  0.026364

If always same index and same rows between both DataFrames:

df1 = df1.replace('NaN', np.nan)
df2 = df2.replace('NaN', np.nan)

df = df1.fillna(df2)
print (df)
          Date      Time   Column1   Column2
0   29/07/2021  06:48:00  0.041808  0.000270
1   29/07/2021  06:48:00  0.019608  0.000314
2   29/07/2021  06:56:00  0.038808  0.000384
3   29/07/2021  06:56:00  0.018808  0.000584
4   30/07/2021  07:14:00  0.025931  0.000484
5   30/07/2021  07:14:00  0.025164  0.000643
6   30/07/2021  07:40:00  0.026561  0.000533
7   30/07/2021  07:40:00  0.027744  0.000543
8   31/07/2021  08:42:00  0.028854  0.000342
9   31/07/2021  08:42:00  0.000384  0.056264
10  01/08/2021  08:52:00  0.000543  0.042164
11  01/08/2021  08:52:00  0.000342  0.035164
12  02/08/2021  09:07:00  0.000314  0.025164
13  02/08/2021  09:07:00  0.000414  0.026364
  • Related