Home > Back-end >  How to parse data from individual records instead of the complete file
How to parse data from individual records instead of the complete file

Time:09-17

I am creating a program that uses a title and author to get data. Until now, I was able to get all the data using BeautifulSoup and it's lxml parser, but I seem to have hit a wall, and I don't know how to overcome it.

The Situation: I am processing this data: https://sru.k10plus.de/opac-de-627!rec=1?version=1.1&operation=searchRetrieve&query=pica.tit=Geschichten aus unserer Zeit and pica.all=Hotz, Karl&maximumRecords=100&recordSchema=marcxml

and am trying to get data from the fields 245, 250, 583 and 924. They can appear multiple times in the response, and contain different data. Using Beautifulsoup works fine, but the problem is that I want to seperate the results based on the records, as some titles have the 583 entry and others do not, and I need to know which one has, based on 245, 583 and 924. But I have no Idea how to do this with BeautifulSoup

This is how the collapsed file looks like:

<zs:searchRetrieveResponse>
<zs:version>1.1</zs:version>
<zs:numberOfRecords>14</zs:numberOfRecords>
<zs:records>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
<zs:record>    </zs:record>
</zs:records>
<zs:echoedSearchRetrieveRequest>
<zs:version>1.1</zs:version>
<zs:query>
pica.tit=Geschichten aus unserer Zeit and pica.all=Hotz, Karl
</zs:query>
<zs:maximumRecords>100</zs:maximumRecords>
<zs:recordPacking>xml</zs:recordPacking>
<zs:recordSchema>marcxml</zs:recordSchema>
</zs:echoedSearchRetrieveRequest>
</zs:searchRetrieveResponse>

and I want to iterate over every <zs:record> and check the contents. This is how I currently search the whole response:

    r = requests.get(all_url)
    r.encoding = 'utf-8'
    if r.status_code == 200:
        #count=search_url_xml(all_url)
        count=0
        is_list=[]
        issues=[]
        de640l=[]
        soup = BeautifulSoup(r.text, 'lxml')
        for datafield in soup.find_all("datafield"):
            
            #try:
                
                if datafield['tag'] == '250': #get all issues
                    subfield_a = datafield.find_next("subfield",{"code":"a"})
                    if subfield_a != None:
                        issue=subfield_a.text
                        if issue not in issues:
                            issues.append(issue)
                    #find all datafields with the tag 924 before the next datafield with the tag 250
                    datafield_924 = datafield.find_all("datafield",{"tag":"924"})
                    
                if datafield['tag'] == "583":
                    subfield_z = datafield.find_next("subfield",{"code":"z"})
                    de640=subfield_z.text
                    #if de640 not in de640l:
                    de640l.append(de640)
             
        for issue in issues:
            issuelist={'issue':[],'amount':[],'de640':[]}   
            issuelist['issue'].append(issue)
            issuelist['amount'].append(issues.count(issue))
            issuelist['de640'].append(de640l[0])
            is_list.append(issuelist)

I have thought about changing over to some XML API like etree, but I don't really have much experience with that.

How would I do that?

CodePudding user response:

You can pass a list of tags you want to target.

Is this what you're looking for?

import requests
from bs4 import BeautifulSoup

xml_url = "https://sru.k10plus.de/opac-de-627!rec=1?version=1.1&operation=searchRetrieve&query=pica.tit=Geschichten aus unserer Zeit and pica.all=Hotz, Karl&maximumRecords=100&recordSchema=marcxml"

tags_to_check = ["245", "250", "583", "924"]
data = (
    BeautifulSoup(requests.get(xml_url).text, features="xml")
    .find_all("datafield", tag=tags_to_check)
)
print("\n".join([field.find(code="a").getText() for field in data]))

Sample output:

Geschichten aus unserer Zeit
4. Aufl.
177646861
Apfel auf silberner Schale
2. Aufl., [Nachdr.]
Archivierung prüfen
177647086
3105051929
Wie war der Himmel blau
1. Aufl., 1. Dr.
Archivierung prüfen
175548927
3105050302
310505037X
Wie war der Himmel blau
1. Aufl.
3175270174
Wie war der Himmel blau
1. Aufl.
177712430
Geschichten aus unserer Zeit
3. Aufl.
Archivierung prüfen
3119615285
Apfel auf silberner Schale
2. Aufl
385860854
Geschichten aus unserer Zeit
2. Aufl
385621000
Du fährst zu oft nach Heidelberg
2. Aufl
177647965
385623240
Du fährst zu oft nach Heidelberg
1. Auflage, 1. Druck
Archivierung prüfen
3246143692
3246143781
Geschichten aus unserer Zeit
1. Aufl.
Archivierung prüfen
3124711890
Geschichten aus unserer Zeit
1. Aufl
249788438
Apfel auf silberner Schale
1. Aufl., 1. Dr.
Archivierung prüfen
1440117691
398303622
3234020019
3234020078
Geschichten aus unserer Zeit
249787954
  • Related