Home > Net >  Read XML using pandas parsing to csv
Read XML using pandas parsing to csv

Time:05-09

I have the following code to extract data from XML to CSV file, but there is an error and I don't know how to solve it.

if anyone can help, please.

url = "http://90.161.233.78:65519/services/user/records.xml?begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?var=EDSLINEEMBEDDED.Module2.API1?period=900"

s = unescape(requests.get(url).text)[5:-6]
df = pd.read_xml(s, xpath="//record/* | //dateTime")
df["field"] = df["field"].ffill()
df.to_csv('output0.csv')

The Error is

  doc = fromstring(
  File "src\lxml\etree.pyx", line 3252, in lxml.etree.fromstring
  File "src\lxml\parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1800, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
  File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 2

CodePudding user response:

Consider reading URL without requests or escaping content directly into pandas.read_xml(). Per docs, emphasis added:

path_or_buffer: str, path object, or file-like object

String, path object (implementing os.PathLike[str]), or file-like object implementing a read() function. The string can be any valid XML string or a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file.

import pandas as pd

url = (
    "http://90.161.233.78:65519/services/user/records.xml?"
    "begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?"
    "var=EDSLINEEMBEDDED.Module2.API1?period=900"
)

df = pd.read_xml(url, xpath="//record/* | //dateTime")

# FILL PARENT TEXT FORWARD TO CHILD ITEMS
df["dateTime"] = df["dateTime"].ffill()

# DROP UNNEEDED ROWS
df[(pd.notnull(df["id"])) & (pd.notnull(df["value"]))]

df.to_csv('output0.csv')
  • Related