Home > Software engineering >  a bytes-like object is required, not 'str' while parsing XML files
a bytes-like object is required, not 'str' while parsing XML files

Time:01-24

I am trying to parse an xml that looks like this. I want to extract information regarding the katagorie i.e ID, parent ID etc:

<?xml version="1.0" encoding="UTF-8"?>
<test timestamp="20210113">
<kategorien>
    <kategorie id="1" parent_id="0">
        Sprache
    </kategorie>
</kategorien>
</test>

I am trying this

fields = ['id', 'parent_id']

with open('output.csv', 'wb') as fp:
    writer = csv.writer(fp)
    writer.writerow(fields)
    tree = ET.parse('./file.xml')
    # from your example Locations is the root and Location is the first level
    for elem in tree.getroot():
        writer.writerow([(elem.get(name) or '').encode('utf-8') 
            for name in fields])

but I get this error:

in <module>
    writer.writerow(fields)
TypeError: a bytes-like object is required, not 'str'

even though I am already using encode('utf-8') in my code. How can I get rid of this error?

CodePudding user response:

EDIT 2 If want to find regarding nested attributes or sub-classes, there are two ways:

  1. You can use a nested loop:
for elem in root:
    for child in elem:
        print([(child.attrib.get(name) or 'c') for name in fields])

Output:

['1', '0']

Here, it can also return for classes which have id and parent_id but not the name kategorie.

  1. If you want to perform the task with a bit more performance and less memory:
for elem in root.iter('kategorie'):
    print([(elem.attrib.get(name) or 'c') for name in fields])

Output:

['1', '0']

For this method, it will return for every class and sub-class named kategorie.

EDIT 1: For the issue in comments:

<?xml version="1.0"?>
<kategorien>
    <kategorie id="1" parent_id="0">
        Sprache
    </kategorie>
</kategorien>

For the above xml file, the code seems to work perfectly:

fields = ['id', 'parent_id']

for elem in tree.getroot():
    print([(elem.attrib.get(name) or 'c') for name in fields])

Output:

['1', '0']

Original Answer: Looks like you are looking at the wrong location for the error. The error is actually occurring at

writer.writerow(fields)

fields is a list containing str and not byte, that is why it is giving you the error. I would have recommended you to change the write type from wb to w, but looking at the rest of the code, it looks like you want to write in byte.

writer.writerow([x.encode('utf-8') for x in fields])

encode() just converts your data to byte form.

CodePudding user response:

I see two problems. First, you don't need to do the encoding yourself. Open the file without the "b" binary flag and skip .encode. The file object will do the encoding for you. The error you see comes from the ['id', 'parent_id'] list holding unencoded strings. But if you don't open in binary in the first place, its not a problem.

Second, you are iterating the wrong element. Add a print(elem) in your loop and you'll see. instead, you can use findall with a pseudo-xpath to get the elements you want.

import csv
import xml.etree.ElementTree as ET

fields = ['id', 'parent_id']

with open('output.csv', 'w') as fp:
    writer = csv.writer(fp)
    writer.writerow(fields)
    tree = ET.parse('./file.xml')
    # from your example Locations is the root and Location is the first level
    for elem in tree.getroot().findall('kategorien/kategorie'):
        writer.writerow([(elem.get(name) or '') 
            for name in fields])
  • Related