Home > Blockchain >  How can I efficiently perform the cartesian product of two pandas dataframes in Python?
How can I efficiently perform the cartesian product of two pandas dataframes in Python?

Time:12-22

So my goal is to expand my monthly data into daily data and I thought the best way would be to merge the two dataframes below. my pandas is pre 1.2 so I cant use merge with cross. Is there another efficient way?

I have two dataframes. One dataframe is a single column with times like:

A = 2000-01-01,2000-01-02....2000-12-31

and my second dataframe has a couple different columns, say time, city, food. I want to merge my second dataframe with the first on time but time has the format

time = 2000-01-01,2000-02-01...2000-12-01

for each city it has. To visualise its like

London 2000-01-01  apple
London 2000-01-01  orange
London 2000-01-01  Rasberry
London 2000-02-01  apple
London 2000-02-01  orange
London 2000-02-01  Rasberry
...
London 2000-12-01  Rasberry
Paris 2000-01-01  apple
Paris 2000-01-01  orange
Paris 2000-01-01  Rasberry
Paris 2000-02-01  apple
Paris 2000-02-01  orange
Paris 2000-02-01  Rasberry
...
Paris  2000-12-01 Rasberry

and I want to merge it so it ends up as:

London 2000-01-01 apple
London 2000-01-02 NAN
..
London 2000-12-01 Apple
..
London 2000-12-31 NAN
London 2000-01-01 Orange
London 2000-01-02 NAN
..
London 2000-12-01 Orange
..
London 2000-12-31 NAN
Paris 2000-01-01 apple
Paris 2000-01-02 NAN
..
Paris 2000-12-01 Apple
..
Paris 2000-12-31 NAN
Paris 2000-01-01 Orange
Paris 2000-01-02 NAN
..
Paris 2000-12-01 Orange
..
Paris 2000-12-31 NAN

The NANS being from when I merge it, there are no values from the 2nd-31st of each city of each fruit. If there is a simple way to do this please let me know. and its in year month day format. I believed it to be some sort of outer join but it has messed up the format.

CodePudding user response:

I would use a groupby followed with a left merge:

df2.groupby('city').apply(lambda df: df1.merge(df, 'left', on='time')
                          )[['time', 'food']].reset_index(
                              level='city').reset_index(drop=True)

I got:

       city        time      food
0    London  2000-01-01     apple
1    London  2000-01-01    orange
2    London  2000-01-01  Rasberry
3    London  2000-01-02       NaN
4    London  2000-01-03       NaN
5    London  2000-01-04       NaN
...
30   London  2000-01-29       NaN
31   London  2000-01-30       NaN
32   London  2000-01-31       NaN
33   London  2000-02-01     apple
34   London  2000-02-01    orange
35   London  2000-02-01  Rasberry
36   London  2000-02-02       NaN
37   London  2000-02-03       NaN
...
337  London  2000-11-29       NaN
338  London  2000-11-30       NaN
339  London  2000-12-01  Rasberry
340  London  2000-12-02       NaN
341  London  2000-12-03       NaN
342  London  2000-12-04       NaN
...
367  London  2000-12-29       NaN
368  London  2000-12-30       NaN
369  London  2000-12-31       NaN
370   Paris  2000-01-01     apple
371   Paris  2000-01-01    orange
372   Paris  2000-01-01  Rasberry
373   Paris  2000-01-02       NaN
374   Paris  2000-01-03       NaN
...
401   Paris  2000-01-30       NaN
402   Paris  2000-01-31       NaN
403   Paris  2000-02-01     apple
404   Paris  2000-02-01    orange
405   Paris  2000-02-01  Rasberry
406   Paris  2000-02-02       NaN
407   Paris  2000-02-03       NaN
...
707   Paris  2000-11-29       NaN
708   Paris  2000-11-30       NaN
709   Paris  2000-12-01  Rasberry
710   Paris  2000-12-02       NaN
711   Paris  2000-12-03       NaN
...
738   Paris  2000-12-30       NaN
739   Paris  2000-12-31       NaN
  • Related