How to scrap values of a specific paragraph based on pattern-CodePudding

In the page there is a paragraph like this:

The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'.

In the page it is: L’ultimo bilancio depositato da Euro P.a. - S.r.l. nel registro delle imprese corrisponde all’anno 2020 e riporta un range di fatturato di 'Tra 6.000.000 e 30.000.000 Euro'.

I need to scrape the value inside the ' ' in this case (Between 6,000,000 and 30,000,000 Euros). And put it inside a column called "range".

I tried with no success this code:

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'https://www.informazione-aziende.it/Azienda_EURO-PA-SRL'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')

turnover = soup.find("span", {"id": "turnover"}).text
year = soup.find("span", {"id": "year"}).text

data = {'turnover': turnover, 'year': year}
df = pd.DataFrame(data, index=[0])
print(df)

But i get: AttributeError: 'NoneType' object has no attribute 'text'

CodePudding user response：

First, scrape the whole text with BeautifulSoup, and assign it to a variable such as:

text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."

Then, execute the following code:

import re

pattern = "'. '"
result = re.search(pattern, text)
result = result[0].replace("'", "")

The output will be:

'Between 6,000,000 and 30,000,000 Euros'

CodePudding user response：

An alternative can be:

Split the text by the single quote character - ' - and get the text at position 1 of the list.

Code:

text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."

# Get the text at position 1 of the list: 
desired_text = text.split("'")[1]

# Print the result: 
print(desired_text)

Result:

Between 6,000,000 and 30,000,000 Euros