How to convert two dataframes (with same cols/rows) into a dataframe of tuples?-CodePudding

how to convert two dataframes e.g.:

df1=pd.DataFrame({'A': [1, 2,3], 'B': [10, 20,30]})
df2=pd.DataFrame({'A': [11, 22,33], 'B': [110, 220, 330]})

into

     A       B
0   (1, 11) (10, 110)
1   (2, 22) (20, 220)
2   (3, 33) (30, 330)

I'm trying to find a pandas function instead of using a loop. This is just a dummy example and the original dataframes have many columns

CodePudding user response：

You can use pd.join:

df1.join(df2, lsuffix='1', rsuffix='2').apply(tuple, axis=1).to_frame('A')

CodePudding user response：

The fastest way is probably

df = pd.DataFrame({"A": zip(df1.A, df2.A)})

Much faster and simpler than the other solutions

def repeat_df(df, n):
    return pd.concat([df]*n, ignore_index=True)

n = 1000

df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'A': [11, 22, 32]})

df1 = repeat_df(df1, n)
df2 = repeat_df(df2, n)

>>> %timeit df1.join(df2, lsuffix='1', rsuffix='2').apply(tuple, axis=1).to_frame('A')

36.6 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit pd.concat([df1, df2]).groupby(level=0).agg(tuple)

39.8 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit pd.DataFrame({"A": zip(df1.A, df2.A)})

1.95 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

EDIT

OP updated the example to work with multiple columns. The above solution can be easily generalized

df = pd.DataFrame({col: zip(df1[col], df2[col]) for col in df1.columns})

Still much faster than the other solution. Assuming the same settings

>>> %timeit pd.concat([df1, df2]).groupby(level=0).agg(tuple)

70.3 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit pd.DataFrame({col: zip(df1[col], df2[col]) for col in df1.columns})

3.41 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response：

you can concat both then use groupby.agg on the index. Using this method would align columns and groupby identical index.

print(pd.concat([df1, df2]).groupby(level=0).agg(tuple)) 
         A
0  (1, 11)
1  (2, 22)
2  (3, 32)

that said, in this specific case, maybe using a list comprehension is faster

pd.DataFrame({'A':[(a1, a2) for a1, a2 in zip(df1['A'], df2['A'])]})

CodePudding user response：

You can use pandas pd.itertuples for converting dataframe into pandas tuples after merging two dataframes.

df = pd.concat([df1, df2])
tuples = df.itertuples()