Home > Software design >  How to create duplicates for every value in a dataframe
How to create duplicates for every value in a dataframe

Time:11-03

I have a dataframe where I need to insert dates for every key. Both come from two seperate dataframes with no columns to merge on. I assume it would maybe be through some for loop but not sure how it would look. Here are the two original dataframes:

df_this = pd.DataFrame({'scu_key': [4, 8, 10, 12]})

df_that = pd.DataFrame({'dates': ['02-2018', '03-2018', '04-2018', '05-2018']})

df_all_that = pd.DataFrame({'scu_key':[4, 4, 4, 4, 8, 8, 8, 8, 10, 10, 10, 10, 12, 12, 12, 12], 'dates': ['02-2018', '03-2018', '04-2018', '05-2018','02-2018', '03-2018', '04-2018', '05-2018','02-2018', '03-2018', '04-2018', '05-2018','02-2018', '03-2018', '04-2018', '05-2018']})

So as you can see there are repeat values for 'scu_key' because each scu_key will have each of the dates attached to it.

CodePudding user response:

I'm having trouble understanding what you are looking for.... to simply create a "copy" of a list of a dataframe would be as simple as going:

df_all_of_that['scu_keys_duplicate'] = df_all_of_that['scu_keys']

but from the data are you trying to make a loop that adds the new dates depending on the keys....?

CodePudding user response:

Use pd.MultiIndex.from_product:

>>> pd.MultiIndex.from_product([df1['scu_key'], df2['dates']]) \
                 .to_frame().reset_index(drop=True)

    scu_key    dates
0         4  02-2018
1         4  03-2018
2         4  04-2018
3         4  05-2018
4         8  02-2018
5         8  03-2018
6         8  04-2018
7         8  05-2018
8        10  02-2018
9        10  03-2018
10       10  04-2018
11       10  05-2018
12       12  02-2018
13       12  03-2018
14       12  04-2018
15       12  05-2018

CodePudding user response:

Just use cross for pd.merge:

df_this.merge(df_that, how='cross')
    scu_key    dates
0         4  02-2018
1         4  03-2018
2         4  04-2018
3         4  05-2018
4         8  02-2018
5         8  03-2018
6         8  04-2018
7         8  05-2018
8        10  02-2018
9        10  03-2018
10       10  04-2018
11       10  05-2018
12       12  02-2018
13       12  03-2018
14       12  04-2018
15       12  05-2018

An alternative is expand_grid from pyjanitor, which, depending on the array size, could be faster than pd.merge(cross):

# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
df_this.expand_grid(df_key='A', others={'B':df_that}).droplevel(0,1)
 
    scu_key    dates
0         4  02-2018
1         4  03-2018
2         4  04-2018
3         4  05-2018
4         8  02-2018
5         8  03-2018
6         8  04-2018
7         8  05-2018
8        10  02-2018
9        10  03-2018
10       10  04-2018
11       10  05-2018
12       12  02-2018
13       12  03-2018
14       12  04-2018
15       12  05-2018

For even more performance, this thread has some fast implementations.

  • Related