I'm using BeautifulSoup to parse XML and return a 5-character string. Python is interpreting some of the strings as scientific notation (e.g., 0E001, 1E001), but I need to get the string itself.
I've tried applying str(some_code) and f"{some_code}" in multiple places, but can't seem to figure out how to force Python to interpret the result as a string as opposed to scientific notation. Here are the excerpts of the relevant code:
Function definition:
def find_eccn(text:str) -> str:
"""
This function returns a string with the substring preceding the first whitespace.
"""
return text[:text.find(" ")]
Within main() function:
for i in soup.find_all(["P"]):
# Get text in line and strip extra whitespace.
line = i.get_text().strip()
# Parse the line into the string.
eccn = find_eccn(line)
Any suggestions would be greatly appreciated!
Edit: Here's an example of the text within a tag that's causing problems: '1E001 “Technology” according to the General Technology Note for the “development” or “production” of items controlled' - The function is returning 10 instead of 1E001.
Edit 2: Here is the full code:
# Import required modules.
from datetime import date
import requests
from bs4 import BeautifulSoup
import csv
# Define a function to find the ECCN in the CCL.
def find_eccn(text:str) -> str:
result = text[:text.find(" ")]
return result
# Define a function to find the ECCN description in the CCL.
def find_eccn_description(text):
result = text[text.find(" ") 1:]
return result
# Define a main function which generates a CSV file with CCL data.
def main():
# Determine today's date.
today = date(2022, 10, 27)
# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-15.xml?subtitle=B&chapter=VII&subchapter=C&part=774"
# Initialize a requests Response object.
page = requests.get(url)
# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")
# Remove all of the <note> tags.
for x in soup.find_all("NOTE"):
x.decompose()
# Reduce the soup object to only the first <div9> tag.
reduction = soup.find("DIV9")
# Define CCL field headers.
fields = ["classification", "eccn", "eccn_description", "item", "item_description"]
# Define a list for CCL rows.
rows = []
# Loop through the <*insert applicable*> tags in the soup object.
for i in reduction.find_all(["FP-2"]):
# Get text in line and strip extra whitespace.
line = i.get_text().strip()
# Parse the line into the ECCN and its description.
eccn = find_eccn(line)
eccn_description = find_eccn_description(line)
# Append data to the 'rows' list.
rows.append(["classification", eccn, eccn_description, "item", "item_description"])
# Create CSV file and write USML data to it.
with open(f"ccl_{today}.csv", "w", encoding="utf-8", newline="") as csvfile:
csvwriter = csv.writer(csvfile, delimiter="|")
csvwriter.writerow(fields)
csvwriter.writerows(rows)
# Set main function to run when the file is run.
if __name__ == "__main__":
main()
CodePudding user response:
I think your code works correctly: It may be that the 1E001
is being parsed into scientific notation by the program used for looking at your .csv
file (i.e. Excel).
However, from my perspective, it appears that the output eccn
identifier is correct (unless I'm missing something). Here's a screenshot from the output generated from your posted code:
You could try inserting a '
in front of the eccn
number to ensure that it isn't treated as scientific notation in Excel.