Home > database >  explode the pandas nested array in python
explode the pandas nested array in python

Time:03-13

I am reading data from MongoDB and dropping in s3. Reading the data using Athena.

This is my collection which contains Items columns which is an array. How to explode that into separate columns when saving that to s3.

{"_id":{"$oid":"11111111"},
"receiptId":"rtrtrtrttrtrtrtr",
"paymentSystem":"CARD",
"lastFourDigit":"1111",
"cardType":"ghsl",
"paidOn":{"$numberLong":"1623078706000"},
"currency":"USD",
"totalAmountInCents":{"$numberInt":"0000"},
"items":[{"title":"Jun 21 - Jun 21,2022",
"description":"Starter",
"currency":"USD",
"amountInCents":{"$numberInt":"0000"},
"itemType":"SUBSCRIPTION_PLAN",
"id":{"$numberInt":"1"},
"frequency":"YEAR",
"periodStart":{"$numberLong":"1624288306000"},
"periodEnd":{"$numberLong":"1655824306000"}}],
"subscriptionPlanTitle":"Starter",
"subscriptionPlanFrequency":"YEAR",
"uuid":"1111111111",
"createTimestamp":{"$numberLong":"1624292188650"},
"updateTimestamp":{"$numberLong":"1624292188650"}}

Python Code I tried,

mylist = []
myresult = collection.find(query)
    mylist = []
    for x in myresult:
        mylist.append(x)
    df = json_normalize(mylist)
    df1 = df.applymap(str)

I am able to save that into parquet. But items all are in a single column. Is there a way to explode dynamically?

output schema might be


_id                          object
id                           object
createTimestamp              object
updateTimestamp              object
deleteTimestamp              object
receiptId                    object
paymentSystem                object
lastFourDigit                object
cardType                     object
paidOn                       object
currency                     object
totalAmountInCents           object
items.title                   object
items.description             object
items.currency                object
items.amountInCents           object
items.itemType               object
items.id                     object
items.frequency              object
items.periodstart            object
items.periodend              object
subscriptionPlanTitle        object
subscriptionPlanFrequency    object
uuid                         object
consumerEmail                object
taxAmountInCents             object
gifted                       object

CodePudding user response:

You could use json_normalize:

out = pd.json_normalize(data, ['items'], list(data.keys() - {'items'}), record_prefix = 'items.')

Another option is to create a DataFrame with data; then explode and build a DataFrame separately with "items" column; then join:

df = pd.json_normalize(data)
out1 = df.join(df['items'].explode().pipe(lambda x: pd.DataFrame(x.tolist())).add_prefix('items.')).drop(columns='items')

Output:

            items.title items.description items.currency     items.itemType  \
0  Jun 21 - Jun 21,2022           Starter            USD  SUBSCRIPTION_PLAN   

  items.frequency items.amountInCents.$numberInt items.id.$numberInt  \
0            YEAR                           0000                   1   

  items.periodStart.$numberLong items.periodEnd.$numberLong cardType  ...  \
0                 1624288306000               1655824306000     ghsl  ...   

         uuid lastFourDigit                   _id currency  \
0  1111111111          1111  {'$oid': '11111111'}      USD   

       totalAmountInCents                   createTimestamp  \
0  {'$numberInt': '0000'}  {'$numberLong': '1624292188650'}   

                             paidOn                   updateTimestamp  \
0  {'$numberLong': '1623078706000'}  {'$numberLong': '1624292188650'}   

  subscriptionPlanTitle paymentSystem  
0               Starter          CARD  

[1 rows x 22 columns]

Note that some of the keys in the metadata (e.g. "taxAmountInCents") don't exist in the sample.

  • Related