I have been trying to flatten column sellers.price_data for a long time with no luck. sellers.price_data is the last column in the table output below. In order to get to this output, I have had to normalize and explode the data provided in the api link in my code below.
sku product_name map_price sellers.seller_name sellers.website_url sellers.price_data
0 P350022-141 P350022-141 1-60W MED FLUSH MOUNT - Galvanized... 47.00 amazon https://www.amazon.com/dp/B08NC8LKFY [{'date': '2021-10-12', 'price': '47.99'}]
1 P350022-141 P350022-141 1-60W MED FLUSH MOUNT - Galvanized... 47.00 homedepot https://www.google.com/aclk?sa=L&ai=DChcSEwioy... [{'date': '2021-10-12', 'price': '59.99'}]
2 P350022-141 P350022-141 1-60W MED FLUSH MOUNT - Galvanized... 47.00 lowes https://www.google.com/aclk?sa=L&ai=DChcSEwioy... [{'date': '2021-10-12', 'price': '64.76'}]
3 P350022-141 P350022-141 1-60W MED FLUSH MOUNT - Galvanized... 47.00 overstock https://www.overstock.com/32533008/product.htm... [{'date': '2021-10-12', 'price': '59.99'}]
4 P350022-141 P350022-141 1-60W MED FLUSH MOUNT - Galvanized... 47.00 canadalightingexperts https://www.canadalightingexperts.com/lighting... [{'date': '2021-10-12', 'price': '47.99'}]
Code to produce output:
import pandas as pd
import requests
import json
from tabulate import tabulate
response = 'https://mschannellogin.com/Api/produc_matrix_tier_1?api_key=x'
requests.get(response).json()
json_text = requests.get(response).json()
# ESTABLISH WHAT ARE THE OUTERMOST KEYS
for i in json_text:
print(i)
# STATUS, MESSAGE, DATA ARE THE 3 OUTERMOST KEYS
# GO INSIDE OF 'data'
parse_json_text=[]
for i in json_text['data']:
parse_json_text.append(i)
pd.options.display.width=None
print(tabulate(pd.DataFrame(parse_json_text).head()))
# DEAL WITH COLUMNS THAT ARE STILL LISTS
data_df=pd.DataFrame(json_text['data'])
print(data_df.head(4))
print(tabulate(data_df.loc[:,['sku', 'product_name', 'map_price', 'sellers']]))
mod_df=data_df.loc[:,['sku', 'product_name', 'map_price', 'sellers']]
for i in mod_df:
mod_df=mod_df.explode(i)
print(tabulate(mod_df.head(9)))
flat_dct=pd.json_normalize(json.loads(mod_df.\
to_json(orient="records")))
flat_dct = pd.DataFrame(flat_dct)
print(flat_dct.head())
The 'data' key only recognizes sku, product name, map price, and sellers. I do not know how to re-loop through the price_data column.
My eventual goal is to bring this into PowerBI through a python script. It works the last issue is column sellers.price_data.
Any suggestions, I have been working on this for a long time with no luck, thank you.
CodePudding user response:
IIUC, you want:
mod_df = data_df.explode("sellers").reset_index(drop=True)
flat_df = mod_df.drop("sellers", axis=1)
.join(pd.json_normalize(mod_df["sellers"],
record_path="price_data",
meta=["seller_name", "website_url"]
)
)
CodePudding user response:
You can use json_normalize
and specify the meta
parameters.
pd.json_normalize(
json_text["data"],
record_path=["sellers", "price_data"],
meta=[
"sku",
"product_name",
"map_price",
["sellers", "seller_name"],
["sellers", "website_url"],
],
)
These are the columns produced:
['date',
'price',
'sku',
'product_name',
'map_price',
'sellers.seller_name',
'sellers.website_url']