I currently have 2 lists of tuples, both of which have the same information on the first element of the tuple. I'm trying to see if there is a way to "Join" these two tuple lists in a dataframe based on their common elements. Something like an SQL JOIN on a common column.
Lists are something like this:
listA = [(0, A), (1, B), (2, C)]
listB = [(0, G), (1, H), (2, I)]
and what I'm trying to achieve is a dataframe that looks something like this:
Col1 Col2 Col3
0 A G
1 B H
2 C I
Ideally, I dont want to "extract" the first element of a single list into a separete one and use that as the first column. I really want this "Join on the common column" functionality. The reason being that Im not sure that the tuples match on every single place and so I would like that be cared for automatically (like in SQL).
CodePudding user response:
Try this,
Code:
import pandas as pd
l1 = [(0, 'A'), (1, 'B'), (2, 'C')]
l2 = [(0, 'G'), (1, 'H'), (2, 'I')]
ur_lists = [l1, l2]
list_of_dfs = [pd.DataFrame(data, columns=['key', f'col{idx}'])
for idx, data in enumerate(ur_lists)]
dfs = [df.set_index('key') for df in list_of_dfs]
pd.concat(dfs, axis=1).reset_index()
Output:
key col0 col1
0 A G
1 B H
2 C I
CodePudding user response:
TRY THIS:
import pandas as pd
listA = [(0, 'A'), (1, 'B'), (2, 'C')]
listB = [(0, 'G'), (1, 'H'), (2, 'I')]
df = pd.DataFrame({"COL": [x[0] for x in listA], "COL2": [x[1] for x in listA]})
df2 = pd.DataFrame({"COL": [x[0] for x in listB], "COL3": [x[1] for x in listB]})
print(df)
print(df2)
t = df.join(df2.set_index('COL'), on='COL')
print(t)
CodePudding user response:
Instead of building multiple DataFrames and concatenating, you can build a nested dictionary and build a single DataFrame with it:
df = pd.DataFrame({f"Col{i}": dict(c) for i,c in enumerate([listA, listB], 2)}).rename_axis('Col1').reset_index()
Output:
Col1 Col2 Col3
0 0 A G
1 1 B H
2 2 C I
This will be faster. The difference will be even more stark if you have many lists:
>>> ur_lists = [listA, listB]
>>> %timeit df = pd.DataFrame({f"Col{i}": dict(c) for i,c in enumerate(ur_lists, 2)}).rename_axis('Col1').reset_index()
1.05 ms ± 36.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit list_of_dfs = [pd.DataFrame(data, columns=['key', f'col{idx}']) for idx, data in enumerate(ur_lists)]; dfs = [df.set_index('key') for df in list_of_dfs]; df = pd.concat(dfs, axis=1).reset_index()
2.76 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)