Create Pandas DataFrame from 2 tuple lists with common first elements-CodePudding

I currently have 2 lists of tuples, both of which have the same information on the first element of the tuple. I'm trying to see if there is a way to "Join" these two tuple lists in a dataframe based on their common elements. Something like an SQL JOIN on a common column.

Lists are something like this:

listA = [(0, A), (1, B), (2, C)]
listB = [(0, G), (1, H), (2, I)]

and what I'm trying to achieve is a dataframe that looks something like this:

Col1  Col2  Col3
 0      A     G
 1      B     H
 2      C     I

Ideally, I dont want to "extract" the first element of a single list into a separete one and use that as the first column. I really want this "Join on the common column" functionality. The reason being that Im not sure that the tuples match on every single place and so I would like that be cared for automatically (like in SQL).

CodePudding user response：

Try this,

Code:

import pandas as pd

l1 = [(0, 'A'), (1, 'B'), (2, 'C')]
l2 = [(0, 'G'), (1, 'H'), (2, 'I')]
ur_lists = [l1, l2]

list_of_dfs = [pd.DataFrame(data, columns=['key', f'col{idx}']) 
               for idx, data in enumerate(ur_lists)]
dfs = [df.set_index('key') for df in list_of_dfs]
pd.concat(dfs, axis=1).reset_index()

Output:

   key col0 col1    
    0   A   G
    1   B   H
    2   C   I

CodePudding user response：

TRY THIS:

import pandas as pd
listA = [(0, 'A'), (1, 'B'), (2, 'C')]
listB = [(0, 'G'), (1, 'H'), (2, 'I')]

df = pd.DataFrame({"COL": [x[0] for x in listA], "COL2": [x[1] for x in listA]})
df2 = pd.DataFrame({"COL": [x[0] for x in listB], "COL3": [x[1] for x in listB]})

print(df)
print(df2)

t = df.join(df2.set_index('COL'), on='COL')

print(t)

CodePudding user response：

Instead of building multiple DataFrames and concatenating, you can build a nested dictionary and build a single DataFrame with it:

df = pd.DataFrame({f"Col{i}": dict(c) for i,c in enumerate([listA, listB], 2)}).rename_axis('Col1').reset_index()

Output:

   Col1 Col2 Col3
0     0    A    G
1     1    B    H
2     2    C    I

This will be faster. The difference will be even more stark if you have many lists:

>>> ur_lists = [listA, listB]
>>> %timeit df = pd.DataFrame({f"Col{i}": dict(c) for i,c in enumerate(ur_lists, 2)}).rename_axis('Col1').reset_index()
1.05 ms ± 36.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit list_of_dfs = [pd.DataFrame(data, columns=['key', f'col{idx}']) for idx, data in enumerate(ur_lists)]; dfs = [df.set_index('key') for df in list_of_dfs]; df = pd.concat(dfs, axis=1).reset_index()
2.76 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)