json_normalize unable to flatten column which is a List of Dictionary-CodePudding

I have been trying to flatten column sellers.price_data for a long time with no luck. sellers.price_data is the last column in the table output below. In order to get to this output, I have had to normalize and explode the data provided in the api link in my code below.

           sku                                       product_name map_price    sellers.seller_name                                sellers.website_url                          sellers.price_data
0  P350022-141  P350022-141 1-60W MED FLUSH MOUNT - Galvanized...     47.00                 amazon               https://www.amazon.com/dp/B08NC8LKFY  [{'date': '2021-10-12', 'price': '47.99'}]
1  P350022-141  P350022-141 1-60W MED FLUSH MOUNT - Galvanized...     47.00              homedepot  https://www.google.com/aclk?sa=L&ai=DChcSEwioy...  [{'date': '2021-10-12', 'price': '59.99'}]
2  P350022-141  P350022-141 1-60W MED FLUSH MOUNT - Galvanized...     47.00                  lowes  https://www.google.com/aclk?sa=L&ai=DChcSEwioy...  [{'date': '2021-10-12', 'price': '64.76'}]
3  P350022-141  P350022-141 1-60W MED FLUSH MOUNT - Galvanized...     47.00              overstock  https://www.overstock.com/32533008/product.htm...  [{'date': '2021-10-12', 'price': '59.99'}]
4  P350022-141  P350022-141 1-60W MED FLUSH MOUNT - Galvanized...     47.00  canadalightingexperts  https://www.canadalightingexperts.com/lighting...  [{'date': '2021-10-12', 'price': '47.99'}]

Code to produce output:

import pandas as pd
import requests
import json
from tabulate import tabulate


response = 'https://mschannellogin.com/Api/produc_matrix_tier_1?api_key=x'
requests.get(response).json()
json_text = requests.get(response).json()

# ESTABLISH WHAT ARE THE OUTERMOST KEYS
for i in json_text:
    print(i)
# STATUS, MESSAGE, DATA ARE THE 3 OUTERMOST KEYS

# GO INSIDE OF 'data'
parse_json_text=[]
for i in json_text['data']:
    parse_json_text.append(i)

pd.options.display.width=None
print(tabulate(pd.DataFrame(parse_json_text).head()))



# DEAL WITH COLUMNS THAT ARE STILL LISTS
data_df=pd.DataFrame(json_text['data'])
print(data_df.head(4))

print(tabulate(data_df.loc[:,['sku', 'product_name', 'map_price', 'sellers']]))

mod_df=data_df.loc[:,['sku', 'product_name', 'map_price', 'sellers']]
for i in mod_df:
    mod_df=mod_df.explode(i)
print(tabulate(mod_df.head(9)))


flat_dct=pd.json_normalize(json.loads(mod_df.\
                                      to_json(orient="records")))

flat_dct = pd.DataFrame(flat_dct)
print(flat_dct.head())

The 'data' key only recognizes sku, product name, map price, and sellers. I do not know how to re-loop through the price_data column.

My eventual goal is to bring this into PowerBI through a python script. It works the last issue is column sellers.price_data.

Any suggestions, I have been working on this for a long time with no luck, thank you.

CodePudding user response：

IIUC, you want:

mod_df = data_df.explode("sellers").reset_index(drop=True)
flat_df = mod_df.drop("sellers", axis=1)
                .join(pd.json_normalize(mod_df["sellers"], 
                                        record_path="price_data", 
                                        meta=["seller_name", "website_url"]
                                        )
                      )

CodePudding user response：

You can use json_normalize and specify the meta parameters.

pd.json_normalize(
    json_text["data"],
    record_path=["sellers", "price_data"],
    meta=[
        "sku",
        "product_name",
        "map_price",
        ["sellers", "seller_name"],
        ["sellers", "website_url"],
    ],
)

These are the columns produced:

['date',
 'price',
 'sku',
 'product_name',
 'map_price',
 'sellers.seller_name',
 'sellers.website_url']