I have a CSV with thousands of hundreds of thousands of rows but basically looks like this
personal_id | location_type | location_number |
---|---|---|
1 | 'company' | 123 |
2 | 'branch | 321 |
1 | 'branch | 456 |
1 | 'branch | 567 |
The goal is to group everything by personal_id
and beneath that have 2 lists of the location_number
that are identified by the location_type
[
{
"personal_id": 1,
"company": [123],
"branch": [456, 567]
},
{
"personal_id": 2,
"branch": [321]
}
]
I used python pandas because i've done something successful before but only at 1 filtering level and using pandas to_dict('records)
worked perfectly at the time
ive been trying to do something in that light such as this
merge_df= (data_df.groupby(['personal_id'])
.apply(lambda x: x[['regulator', 'employee_number', 'sex', 'status']]
.to_dict('records'))
.reset_index()
.rename(columns={0: 'employee'}))
but im not able to figure out how to add an additional filter inside the apply()
as well as this method creates a column which I dont need in the above scenario that I renamed to 'employee'
My only other option is to start everything over in C# with say CSVHelper and maybe automapper if pandas was the wrong choice
CodePudding user response:
Try:
df = df.pivot_table(
index="personal_id", columns="location_type", values="location_number", aggfunc=list
)
out = [out.append(row[row.notna()].to_dict()) for _, row in df.reset_index().iterrows()]
print(out)
Prints:
[
{"personal_id": 1, "branch": [456, 567], "company": [123]},
{"personal_id": 2, "branch": [321]},
]
CodePudding user response:
You can do this:
# groupby personal_id and then in apply groupby and aggregate by list.
s = df.groupby("personal_id").apply(
lambda x: x.groupby("location_type")["location_number"].agg(list).to_dict()
)
# then construct dict from series
out = [{**{"personal_dict": idx}, **v} for idx, v in zip(s.index, s)]
print(out)
[
{"personal_dict": "1", "branch": [456, 567], "company": [123]},
{"personal_dict": "2", "branch": [321]},
]