I'm new with Dataframe. I would like to kwon how (if possible) can I merge 2 Dataframes with multiple match For example
[df1]
date ZipCode Weather
2022-11-25 00:00:00 123456 34
2022-11-25 00:00:15 123456 35
2022-11-25 00:00:30 123456 36
[df2]
date ZipCode host
2022-11-25 00:00:00 123456 host1
2022-11-25 00:00:00 123456 host2
2022-11-25 00:00:00 123456 host3
2022-11-25 00:00:15 123456 host1
2022-11-25 00:00:30 123456 host2
2022-11-25 00:00:30 123456 host3
Expected results:
date ZipCode host Weather
2022-11-25 00:00:00 123456 host1 34
2022-11-25 00:00:00 123456 host2 34
2022-11-25 00:00:00 123456 host3 34
2022-11-25 00:00:15 123456 host1 35
2022-11-25 00:00:30 123456 host2 36
2022-11-25 00:00:30 123456 host3 36
My objetive is assign weather measures to each host. I have weather measurements every 15 minutes for one ZipCode (One line) By the other hand, I have several host KPIs for one time and one ZipCode (multiples lines)
Can I perfomr this activity with Dataframes?
Thanks in advance!
CodePudding user response:
You could use the join function in pandas which joins one dataframe's index to the index of the other. Try something like
import pandas as pd
data1 = \
[['2022-11-25 00:00:00', 123456, 34],
['2022-11-25 00:00:15', 123456, 35],
['2022-11-25 00:00:30', 123456, 36]]
columns1 =['date', 'ZipCode', 'Weather']
data2 = \
[['2022-11-25 00:00:00', 123456, 'host1'],
['2022-11-25 00:00:00', 123456, 'host2'],
['2022-11-25 00:00:00', 123456, 'host3'],
['2022-11-25 00:00:15', 123456, 'host1'],
['2022-11-25 00:00:30', 123456, 'host2'],
['2022-11-25 00:00:30', 123456, 'host3']]
columns2 =['date', 'ZipCode', 'host']
df1 = pd.DataFrame(data=data1, columns=columns1)
df1.date = pd.to_datetime(df1.date)
df1.set_index('date', inplace=True)
df2 = pd.DataFrame(data=data2, columns=columns2)
df2.date = pd.to_datetime(df2.date)
df2.set_index('date', inplace=True)
df3 = df1.join(df2['host'], on='date')
df3
CodePudding user response:
We do that by using merge
and setting the argument on
to ['date', 'ZipCode']
:
new_df = pd.merge(df2, df1, on=['date', 'ZipCode'])
Output
>>> new_df
... date ZipCode host Weather
0 2022-11-25 00:00:00 123456 host1 34
1 2022-11-25 00:00:00 123456 host2 34
2 2022-11-25 00:00:00 123456 host3 34
3 2022-11-25 00:00:15 123456 host1 35
4 2022-11-25 00:00:30 123456 host2 36
5 2022-11-25 00:00:30 123456 host3 36