I am scraping data from different web pages and there are several dates in this data. The code allowing me to have the information that I want looks like this, I only put here the part concerning the dates.
urlsjugement = [
"https://www.societe.com/societe/1804-transport-790808406.html",
"https://www.societe.com/societe/235th-barber-street-enghien-833867153.html",
"https://www.societe.com/societe/2a-protect-894269117.html",
"https://www.societe.com/societe/2fnc-410002000.html",
]
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36"
}
data = []
for url in urlsjugement:
soup = BeautifulSoup(
requests.get(url, headers=headers).content, "html.parser"
)
title = soup.select_one("#identite_deno").get_text(strip=True)
try:
active = soup.select_one('td:-soup-contains("Jugement") td').get_text(
strip=True)
except:
print("Je n'ai pas trouvé de type de jugement pour " title)
active = "En activité"
active = active[0:48]
date = soup.select_one('td:-soup-contains("Date création entreprise") td').get_text(
strip=True)
date = date[0:10]
data.append([title, active, date])
df = pd.DataFrame(
data,
columns=["Title", "Active", "Date"],
)
print(df.to_markdown())
I would like first of all to separate the judgment and the date of judgment into two different data and to be able to compare the two dates. There is a business creation date and a closing date, so I would like to have the lifespan of the businesses, is that possible?
| Title | Active | Date |
|---:|:----------------------------|:--------------------------------------|:-----------|
| 0 | 1804 TRANSPORT | Liquidation judiciaire le 07-01-2022- | 28-01-2013 |
| 1 | 235TH BARBER STREET ENGHIEN | Liquidation judiciaire le 28-01-2022- | 01-10-2017 |
| 2 | 2A PROTECT | Liquidation judiciaire le 17-01-2022- | 12-02-2021 |
| 3 | 2FNC | Liquidation judiciaire le 27-01-2022- | 01-12-1996
I have 2 informations in the column Active and I want separate these. After this I want calculate the time between the two date. Thanks for your help !
CodePudding user response:
I only tried it with your first url, but inside your for
loop, I would make this change:
title = soup.select_one("#identite_deno").text
start = list(soup.select_one('td:-soup-contains("Date création entreprise") td'))[0].text.strip()
end = list(soup.select_one('td.red').stripped_strings)[0].split('le ')[1]
days = datetime.strptime(end, '%d-%m-%Y')-datetime.strptime(start, '%d-%m-%Y')
data.append([title, start, end,days.days])