I want to merge 2 dataframes, with the resulting dataframe having a list in every single cell. I'm completely lost on how to do this.
My current solution is using the index of each dataframe to build a dict (eg. dict[index[0]]['DEPTH'] = []
), and then looping over rows of the dataframes to append to dict keys (eg. dict[index[0]]['DEPTH'].append(cell_value)
), but I'm thinking that's super inefficient and slow.
Does a pandas solution exist that would get this done?
- df1 would look like this:
- df2 would look like this:
- Resulting df would look something like this:
DEPTH A
chr1~10007022~C [1, 1] [0, 0]
chr1~10007023~T [1, 1] [0, 0]
.
.
.
chr1~10076693~T [1, 1] [0, 0]
Keep in mind:
- indexes of dataframe would probably differ, but not always.
- dataframes will probably contain >100M rows each
CodePudding user response:
You could concatenate the two, groupby the item and then agg with list.
import pandas as pd
df = pd.DataFrame({'item':['chr1-10007022-C', 'chr1-10007023-T'],
'DEPTH':[1,1],
'A':[0,0],
'C':[0,0]})
df = df.set_index('item')
df2 = pd.DataFrame({'item':['chr1-10007022-C', 'chr1-10007026-X'],
'DEPTH':[1,1],
'A':[0,0],
'C':[0,0]})
df2 = df2.set_index('item')
out = pd.concat([df,df2]).groupby(level=0).agg(list)
Output
DEPTH A C
item
chr1-10007022-C [1, 1] [0, 0] [0, 0]
chr1-10007023-T [1] [0] [0]
chr1-10007026-X [1] [0] [0]