I am reading data from MongoDB and dropping in s3. Reading the data using Athena.
This is my collection which contains Items columns which is an array. How to explode that into separate columns when saving that to s3.
{"_id":{"$oid":"11111111"},
"receiptId":"rtrtrtrttrtrtrtr",
"paymentSystem":"CARD",
"lastFourDigit":"1111",
"cardType":"ghsl",
"paidOn":{"$numberLong":"1623078706000"},
"currency":"USD",
"totalAmountInCents":{"$numberInt":"0000"},
"items":[{"title":"Jun 21 - Jun 21,2022",
"description":"Starter",
"currency":"USD",
"amountInCents":{"$numberInt":"0000"},
"itemType":"SUBSCRIPTION_PLAN",
"id":{"$numberInt":"1"},
"frequency":"YEAR",
"periodStart":{"$numberLong":"1624288306000"},
"periodEnd":{"$numberLong":"1655824306000"}}],
"subscriptionPlanTitle":"Starter",
"subscriptionPlanFrequency":"YEAR",
"uuid":"1111111111",
"createTimestamp":{"$numberLong":"1624292188650"},
"updateTimestamp":{"$numberLong":"1624292188650"}}
Python Code I tried,
mylist = []
myresult = collection.find(query)
mylist = []
for x in myresult:
mylist.append(x)
df = json_normalize(mylist)
df1 = df.applymap(str)
I am able to save that into parquet. But items all are in a single column. Is there a way to explode dynamically?
output schema might be
_id object
id object
createTimestamp object
updateTimestamp object
deleteTimestamp object
receiptId object
paymentSystem object
lastFourDigit object
cardType object
paidOn object
currency object
totalAmountInCents object
items.title object
items.description object
items.currency object
items.amountInCents object
items.itemType object
items.id object
items.frequency object
items.periodstart object
items.periodend object
subscriptionPlanTitle object
subscriptionPlanFrequency object
uuid object
consumerEmail object
taxAmountInCents object
gifted object
CodePudding user response:
You could use json_normalize
:
out = pd.json_normalize(data, ['items'], list(data.keys() - {'items'}), record_prefix = 'items.')
Another option is to create a DataFrame with data
; then explode
and build a DataFrame separately with "items" column; then join
:
df = pd.json_normalize(data)
out1 = df.join(df['items'].explode().pipe(lambda x: pd.DataFrame(x.tolist())).add_prefix('items.')).drop(columns='items')
Output:
items.title items.description items.currency items.itemType \
0 Jun 21 - Jun 21,2022 Starter USD SUBSCRIPTION_PLAN
items.frequency items.amountInCents.$numberInt items.id.$numberInt \
0 YEAR 0000 1
items.periodStart.$numberLong items.periodEnd.$numberLong cardType ... \
0 1624288306000 1655824306000 ghsl ...
uuid lastFourDigit _id currency \
0 1111111111 1111 {'$oid': '11111111'} USD
totalAmountInCents createTimestamp \
0 {'$numberInt': '0000'} {'$numberLong': '1624292188650'}
paidOn updateTimestamp \
0 {'$numberLong': '1623078706000'} {'$numberLong': '1624292188650'}
subscriptionPlanTitle paymentSystem
0 Starter CARD
[1 rows x 22 columns]
Note that some of the keys in the metadata (e.g. "taxAmountInCents") don't exist in the sample.