Home > OS >  Parsing xml files to csv file using beautifulsoup
Parsing xml files to csv file using beautifulsoup

Time:01-03

I am trying to parse multiple (eventually over 1000) xml files to get three info persName, @ref and the /date. I have managed to get all the files and when I use print() it gives me all the information I want. However when I try to write that information to a csv file only the last xml file is parsed.

from bs4 import BeautifulSoup
import csv
import os
path = r'C:\programming1\my-app'

for filename in os.listdir(path):
    if filename.endswith(".xml"):
        fullpath = os.path.join(path, filename)

        f = csv.writer(open("test2.csv", "w"))
        f.writerow(["date", "Name", "pref"])

        soup = BeautifulSoup (open(fullpath, encoding="utf-8"), "lxml")
        # removing unnecessary information to better isolate //date
        for docs in soup.find_all('tei'):
            for pubstmt in soup.find_all("publicationStmt"): 
                pubstmt.decompose()
            for sourdesc in soup.find_all("sourceDesc"):
                sourdesc.decompose()
            for lists in soup.find_all("list"):
                lists.decompose()
            for heads in soup.find_all("head"):
                lists.decompose()
            #finding all dates of Protokolls under /title
            for dates in soup.find_all("date"):
                date = dates.get('when')

            #getting all Names from xml files exept for thos in /list
            for Names in soup.find_all("persname"):
                nameonly = Names.contents
                nameref = Names.get("ref")
                f.writerow([date, nameonly, nameref])'

If I put writerow under for Names then it only writes all the info for last file and if I put writerow after for Names then it only writes info for one name

Could someone tell me what I am doing wrong? I have tried many for loops and none seem to work.

CodePudding user response:

You wrote:

However when I try to write that information to a csv file only the last xml file is parsed.

From reading your code, what's happening is:

every XML is parsed, but only the last XML file is written to the CSV

and that's because you are opening test2.csv "for writing" for every input XML. When you open for writing, "w", it creates the file, or in your case, it re-creates the file (overwriting its contents) for every iteration.

Because you want a header:

  1. you need to open the CSV before you start iterating the XMLs
  2. write your header
  3. loop over your XMLs processing and writing to the CSV
  4. at the very bottom, after you've exited the loop, close the CSV
  • Related