Here is my data:
df:
id sub_id
A 1
A 2
B 3
B 4
and I have the following array:
[[1,2],
[2,5],
[1,4],
[7,8]]
Here is my code:
from collections import defaultdict
sub_id_array_dict = defaultdict(dict)
for i, s, a in zip(df['id'].to_list(), df['sub_id'].to_list(), arrays):
sub_id_array_dict[i][s] = a
Now, my actual dataframe includes a total of 100M rows (unique sub_id) with 500K unique ids. Ideally, I'd like to avoid a for loop.
Any help would be much appreciated.
CodePudding user response:
Assuming the arrays
variable has same number of rows as in the Dataframe,
df['value'] = arrays
Convert into dictionary by grouping
df.groupby('id').apply(lambda x: dict(zip(x.sub_id, x.value))).to_dict()
Output
{'A': {1: [1, 2], 2: [2, 5]}, 'B': {3: [1, 4], 4: [7, 8]}}
CodePudding user response:
You can assign arrays
to a column and then use pivot
:
df['value'] = arrays
out = df.pivot('sub_id','id','value').to_dict()
Output:
{'A': {1: [1, 2], 2: [2, 5], 3: nan, 4: nan},
'B': {1: nan, 2: nan, 3: [1, 4], 4: [7, 8]}}
If you want to get rid of NaN
s:
new_out = {key: {k:v for k,v in val.items() if v is not np.nan} for key, val in out.items()}
Output:
{'A': {1: [1, 2], 2: [2, 5]}, 'B': {3: [1, 4], 4: [7, 8]}}