How to convert ordered collections to data frame with the proper key value pairs in Python?-CodePudding

I have to get the content from a page that is in XML format into a data frame.

My code to read the file is:


from urllib.request import urlopen
import re
import pandas as pd
import xmltodict
from collections import OrderedDict

x= urlopen('https://mpr.datamart.ams.usda.gov/ws/report/v1/cattle/LM_CT138?filter={"filters":[{"fieldName":"Report date","operatorType":"EQUAL","values":["4/1/2022"]}]}').read().decode('utf-8')

data = xmltodict.parse(x)

print(data)

The output is in ordered Dictionary and I tried number of ways to convert it to the required data frame.

Example


keys = data.keys()
values = data.values()

print ("keys : ", str(keys))
print ("values : ", str(values))

pd.DataFrame.from_dict(values)

I am getting the entire dictionary in one column. I want them split with according to the key values in different coulms.

CodePudding user response：

The data is stored in a nested dict as you see on the screenshot You are trying to create a dataframe with column names being "first level keys". Obviously, you are obtaining "first level values" in the "results" column and the whole dict inside the "report" cell. You need to call the correct level of the nested dictionary to get the correct dataframe.

import pandas as pd
import requests
import xmltodict

url = 'https://mpr.datamart.ams.usda.gov/ws/report/v1/cattle/LM_CT138?filter={"filters":[{"fieldName":"Report date","operatorType":"EQUAL","values":["4/1/2022"]}]}'
r = requests.get(url)
data = xmltodict.parse(r.text)
data_ = data['results']['report']['record']['report']['record']  # returns the list of "OrderedDict"-s with keys = column names
pd.DataFrame(data_)

Sorry for using requests, I'm not really familiar with a urllib library

CodePudding user response：

Try this:

import requests
import xml.etree.ElementTree as ET

r = requests.get('https://mpr.datamart.ams.usda.gov/ws/report/v1/cattle/LM_CT138?filter={"filters":[{"fieldName":"Report date","operatorType":"EQUAL","values":["4/1/2022"]}]}')

root = ET.fromstring(r.text)
records = [record.attrib for record in root.iterfind("report/record/report/record")]
df = pd.DataFrame(records)

Every column comes out as string. You may convert it to other data types as needed.