I have a dataframe where I need to insert dates for every key. Both come from two seperate dataframes with no columns to merge on. I assume it would maybe be through some for loop but not sure how it would look. Here are the two original dataframes:
df_this = pd.DataFrame({'scu_key': [4, 8, 10, 12]})
df_that = pd.DataFrame({'dates': ['02-2018', '03-2018', '04-2018', '05-2018']})
df_all_that = pd.DataFrame({'scu_key':[4, 4, 4, 4, 8, 8, 8, 8, 10, 10, 10, 10, 12, 12, 12, 12], 'dates': ['02-2018', '03-2018', '04-2018', '05-2018','02-2018', '03-2018', '04-2018', '05-2018','02-2018', '03-2018', '04-2018', '05-2018','02-2018', '03-2018', '04-2018', '05-2018']})
So as you can see there are repeat values for 'scu_key' because each scu_key will have each of the dates attached to it.
CodePudding user response:
I'm having trouble understanding what you are looking for.... to simply create a "copy" of a list of a dataframe would be as simple as going:
df_all_of_that['scu_keys_duplicate'] = df_all_of_that['scu_keys']
but from the data are you trying to make a loop that adds the new dates depending on the keys....?
CodePudding user response:
Use pd.MultiIndex.from_product
:
>>> pd.MultiIndex.from_product([df1['scu_key'], df2['dates']]) \
.to_frame().reset_index(drop=True)
scu_key dates
0 4 02-2018
1 4 03-2018
2 4 04-2018
3 4 05-2018
4 8 02-2018
5 8 03-2018
6 8 04-2018
7 8 05-2018
8 10 02-2018
9 10 03-2018
10 10 04-2018
11 10 05-2018
12 12 02-2018
13 12 03-2018
14 12 04-2018
15 12 05-2018
CodePudding user response:
Just use cross
for pd.merge:
df_this.merge(df_that, how='cross')
scu_key dates
0 4 02-2018
1 4 03-2018
2 4 04-2018
3 4 05-2018
4 8 02-2018
5 8 03-2018
6 8 04-2018
7 8 05-2018
8 10 02-2018
9 10 03-2018
10 10 04-2018
11 10 05-2018
12 12 02-2018
13 12 03-2018
14 12 04-2018
15 12 05-2018
An alternative is expand_grid from pyjanitor, which, depending on the array size, could be faster than pd.merge(cross)
:
# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
df_this.expand_grid(df_key='A', others={'B':df_that}).droplevel(0,1)
scu_key dates
0 4 02-2018
1 4 03-2018
2 4 04-2018
3 4 05-2018
4 8 02-2018
5 8 03-2018
6 8 04-2018
7 8 05-2018
8 10 02-2018
9 10 03-2018
10 10 04-2018
11 10 05-2018
12 12 02-2018
13 12 03-2018
14 12 04-2018
15 12 05-2018
For even more performance, this thread has some fast implementations.