How to pivot on a column that is a list of dict?-CodePudding

I have a DataFrame which is constituted of the following fields:

{
    "cohort_key": datetime,
    "last_name": str,
    "file_identifier": str,
    "file_status": str,
    "responsible_label": str,
    "status": str,
    "earliest_appointment_date_key": datetime,
    "latest_appointment_date_key_as_of_today": datetime,
    "months_between_first_last_apppointment": int,
    "count_appointments": int,
    "count_hours": float,
    "procedures": List[{"procedure_key": str, "unit_count": int, "charges_amount_laboratory": float}]
}

with "procedure" being a sparse column where all dicts have the same format (obtained from ARRAY_AGG(STRUCT<...>()) in SQL). Here is some example data:

[
  {
    "cohort_key": 1475280000000,
    "last_name": "Georges",
    "file_identifier": "5672",
    "file_status": "open",
    "responsible_label": "Dr John Doe",
    "status": "probably_done",
    "earliest_appointment_date_key": 1475452800000,
    "latest_appointment_date_key_as_of_today": 1544054400000,
    "months_between_first_last_apppointment": 26,
    "count_appointments": 16,
    "count_hours": 15.5,
    "procedures": [
      {
        "procedure_key": "84000",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "89600",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "80000",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      }
    ],
    "avg_charges_gross_amount_per_appointment": 464.06,
    "avg_charges_gross_amount_per_hour": 479.03,
    "charges_gross_amount_total": 7425.0,
    "charges_amount_laboratory": 0.0
  },
  {
    "cohort_key": 1475280000000,
    "last_name": "White",
    "file_identifier": "6690",
    "file_status": "open",
    "responsible_label": "Dr John Doe",
    "status": "probably_done",
    "earliest_appointment_date_key": 1480291200000,
    "latest_appointment_date_key_as_of_today": 1579737600000,
    "months_between_first_last_apppointment": 37,
    "count_appointments": 13,
    "count_hours": 10.6666666667,
    "procedures": [
      {
        "procedure_key": "84000",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "80000",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "80000",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "80000",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "80672",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "80672",
        "unit_count": 1,
        "charges_amount_laboratory": 0.0
      },
      {
        "procedure_key": "83200",
        "unit_count": 1,
        "charges_amount_laboratory": 83.0
      }
    ],
    "avg_charges_gross_amount_per_appointment": 558.92,
    "avg_charges_gross_amount_per_hour": 681.19,
    "charges_gross_amount_total": 7266.0,
    "charges_amount_laboratory": 83.0
  },
  {
    "cohort_key": 1475280000000,
    "last_name": "Nguyen",
    "file_identifier": "6568",
    "file_status": "open",
    "responsible_label": "Dr John Doe",
    "status": "probably_done",
    "earliest_appointment_date_key": 1479081600000,
    "latest_appointment_date_key_as_of_today": 1479081600000,
    "months_between_first_last_apppointment": 0,
    "count_appointments": 1,
    "count_hours": 0.5,
    "procedures": [],
    "avg_charges_gross_amount_per_appointment": 0.0,
    "avg_charges_gross_amount_per_hour": 0.0,
    "charges_gross_amount_total": 0.0,
    "charges_amount_laboratory": 0.0
  },
  {
    "cohort_key": 1475280000000,
    "last_name": "Lee",
    "file_identifier": "9502",
    "file_status": "open",
    "responsible_label": "Dr John Doe",
    "status": "probably_done",
    "earliest_appointment_date_key": 1478736000000,
    "latest_appointment_date_key_as_of_today": 1478736000000,
    "months_between_first_last_apppointment": 0,
    "count_appointments": 1,
    "count_hours": 0.5,
    "procedures": [],
    "avg_charges_gross_amount_per_appointment": 0.0,
    "avg_charges_gross_amount_per_hour": 0.0,
    "charges_gross_amount_total": 0.0,
    "charges_amount_laboratory": 0.0
  }
]

I would like to change the format of this table so that procedures is expanded as a pivot table with:

index=[all other fields]
columns=procedure[procedure_key]
values=[procedure[unit_count], prodedure[charges_amount_laboratory]

So far my efforts have been mostly fruitless - I reached the point where I have the index and the columns but still have a dict as the value of the columns:

[
  {
    "cohort_key": 1467331200000,
    "last_name": "Johnson",
    "file_identifier": "1227",
    "file_status": "open",
    "responsible_label": "Dr John Doe",
    "status": "probably_done",
    "earliest_appointment_date_key": 1473206400000,
    "latest_appointment_date_key_as_of_today": 1524009600000,
    "months_between_first_last_apppointment": 19,
    "count_appointments": 17,
    "count_hours": 17.5,
    "avg_charges_gross_amount_per_appointment": 397.06,
    "avg_charges_gross_amount_per_hour": 385.71,
    "charges_gross_amount_total": 6750.0,
    "charges_amount_laboratory": 0.0,
    "15109": null,
    "15110": null,
    "80000": { "unit_count": 1, "charges_amount_laboratory": 0.0 },
    "80630": null,
    "80650": null,
    "80671": null,
    "80672": null,
    "81130": null,
    "81207": null,
    "81222": null,
    "83113": null,
    "83114": null,
    "83200": null,
    "83210": null,
    "84000": { "unit_count": 1, "charges_amount_laboratory": 0.0 },
    "84500": null,
    "89600": { "unit_count": 18, "charges_amount_laboratory": 0.0 }
  }
]

which was obtained through the following code

df_patients = df.copy()

def agg_procedures(procedures):
    """Returns single dict from list with procedure_key as key and sum of unit_count, charges_amount_laboratory as values"""
    result = defaultdict(dict)
    for procedure in procedures:
        proc_dict = result[procedure['procedure_key']]
        proc_dict['unit_count'] = procedure['unit_count']   proc_dict.get('unit_count', 0)
        proc_dict['charges_amount_laboratory'] = procedure['charges_amount_laboratory']   proc_dict.get('charges_amount_laboratory', 0)
    return result


df_procedures = df_patients[['cohort_key', 'last_name', 'file_identifier', 'procedures']].copy()
df_normalized_procedures = (
    df_procedures["procedures"]
        .apply(agg_procedures)
        .apply(pd.Series) # see https://stackoverflow.com/questions/38231591/split-explode-a-column-of-dictionaries-into-separate-columns-with-pandas
)
df_normalized_procedures = df_normalized_procedures.reindex(sorted(df_normalized_procedures.columns), axis=1)

df_patients = df_patients.drop("procedures", axis=1)

df_concat = pd.concat([df_patients, df_normalized_procedures], axis=1)
df_concat = df_concat.set_index([col for col in df_patients])

but in the end this method feels like a dead-end - I don't know how to transform the dict into a sublevel. I figured my way of doing it may be wrong.

How would you go about doing this transformation?

CodePudding user response：

What you can do is unnest the dictionnary with a double list comprehension

unnest = pd.DataFrame([_d  for _l in df['procedures'] for _d in _l])
print(unnest.head())
#   procedure_key  unit_count  charges_amount_laboratory
# 0         84000           1                        0.0
# 1         89600           1                        0.0
# 2         89600           1                        0.0
# 3         89600           1                        0.0
# 4         89600           1                        0.0

Note that you could use explode and tolist or maybe stack to get the same result but seems longer and less obvious that you are unnesting data.

After add a column with the original index associated, for this you can repeat the original index with the length of the list in each row.

unnest['orig_idx'] = df.index.repeat(df['procedures'].str.len())

Finally, you can pivot_table

pv_df = unnest.pivot_table(
    index='orig_idx',columns='procedure_key', 
    values=['charges_amount_laboratory','unit_count'], fill_value=0)
print(pv_df)
#               charges_amount_laboratory                         unit_count  \
# procedure_key                     80000 80672 83200 84000 89600      80000   
# orig_idx                                                                     
# 0                                     0     0     0     0     0          1   
# 1                                     0     0    83     0     0          1   

                                       
# procedure_key 80672 83200 84000 89600  
# orig_idx                               
# 0                 0     0     1     1  
# 1                 1     1     1     0

Now you need to reindex with original index (to get rows with empty list in procedures) then set_index all your original dataframe except the procedures column.

res = (
    pv_df.reindex(df.index, fill_value=0)
      .set_index(pd.MultiIndex.from_frame(df[[col for col in df 
                                              if not col.startswith("procedure")]]))
)
print(res)

Note: unnesting in sql seems easier..

CodePudding user response：

After trying to deal with this monstrosity in pandas, I decided to simply UNNEST(procedures) in the returning SQL query and then perform a pivot_table directly on the result:

df.pivot_table(
    index=[col for col in df if not col.startswith("procedure.")],
    columns="procedure_key",
    values=["procedure.unit_count", "procedure.charges_amount_laboratory"],
    aggfunc="sum"
)

I am still interested in a clean, pandas-only solution to this because the UNNEST itself blurs the level of aggregation in the query and is not as explicit as I would like it to be