Home > Software engineering >  Load CSV into pandas and convert to JSON hierarchy based on column values
Load CSV into pandas and convert to JSON hierarchy based on column values

Time:01-11

I have a CSV with thousands of hundreds of thousands of rows but basically looks like this

personal_id location_type location_number
1 'company' 123
2 'branch 321
1 'branch 456
1 'branch 567

The goal is to group everything by personal_id and beneath that have 2 lists of the location_number that are identified by the location_type

[
    {
        "personal_id": 1,
        "company": [123],
        "branch": [456, 567]
    },
    {
        "personal_id": 2,
        "branch": [321]
    }
]

I used python pandas because i've done something successful before but only at 1 filtering level and using pandas to_dict('records) worked perfectly at the time

ive been trying to do something in that light such as this

merge_df= (data_df.groupby(['personal_id'])
    .apply(lambda x: x[['regulator', 'employee_number', 'sex', 'status']]
        .to_dict('records'))
    .reset_index()
    .rename(columns={0: 'employee'}))

but im not able to figure out how to add an additional filter inside the apply() as well as this method creates a column which I dont need in the above scenario that I renamed to 'employee'

My only other option is to start everything over in C# with say CSVHelper and maybe automapper if pandas was the wrong choice

CodePudding user response:

Try:

df = df.pivot_table(
    index="personal_id", columns="location_type", values="location_number", aggfunc=list
)

out = [out.append(row[row.notna()].to_dict()) for _, row in df.reset_index().iterrows()]
print(out)

Prints:

[
    {"personal_id": 1, "branch": [456, 567], "company": [123]},
    {"personal_id": 2, "branch": [321]},
]

CodePudding user response:

You can do this:

# groupby personal_id and then in apply groupby and aggregate by list.
s = df.groupby("personal_id").apply(
    lambda x: x.groupby("location_type")["location_number"].agg(list).to_dict()
)
# then construct dict from series
out = [{**{"personal_dict": idx}, **v} for idx, v in zip(s.index, s)]

print(out)

[
    {"personal_dict": "1", "branch": [456, 567], "company": [123]},
    {"personal_dict": "2", "branch": [321]},
]
  • Related