In the page there is a paragraph like this:
The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'.
In the page it is: L’ultimo bilancio depositato da Euro P.a. - S.r.l. nel registro delle imprese corrisponde all’anno 2020 e riporta un range di fatturato di 'Tra 6.000.000 e 30.000.000 Euro'.
I need to scrape the value inside the ' ' in this case (Between 6,000,000 and 30,000,000 Euros). And put it inside a column called "range".
I tried with no success this code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.informazione-aziende.it/Azienda_EURO-PA-SRL'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
turnover = soup.find("span", {"id": "turnover"}).text
year = soup.find("span", {"id": "year"}).text
data = {'turnover': turnover, 'year': year}
df = pd.DataFrame(data, index=[0])
print(df)
But i get: AttributeError: 'NoneType' object has no attribute 'text'
CodePudding user response:
First, scrape the whole text with BeautifulSoup, and assign it to a variable such as:
text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."
Then, execute the following code:
import re
pattern = "'. '"
result = re.search(pattern, text)
result = result[0].replace("'", "")
The output will be:
'Between 6,000,000 and 30,000,000 Euros'
CodePudding user response:
An alternative can be:
- Split the text by the single quote character -
'
- and get the text at position 1 of the list.
Code:
text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."
# Get the text at position 1 of the list:
desired_text = text.split("'")[1]
# Print the result:
print(desired_text)
Result:
Between 6,000,000 and 30,000,000 Euros