Home > OS >  Scrape and change data in date in BeautifulSoup
Scrape and change data in date in BeautifulSoup

Time:05-13

I am scraping data from different web pages and there are several dates in this data. The code allowing me to have the information that I want looks like this, I only put here the part concerning the dates.

urlsjugement = [
    "https://www.societe.com/societe/1804-transport-790808406.html",
    "https://www.societe.com/societe/235th-barber-street-enghien-833867153.html",
    "https://www.societe.com/societe/2a-protect-894269117.html",
    "https://www.societe.com/societe/2fnc-410002000.html",
]
headers = {
    "User-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36"
}
data = []
for url in urlsjugement:
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    title = soup.select_one("#identite_deno").get_text(strip=True)
    
    try:
        active = soup.select_one('td:-soup-contains("Jugement")   td').get_text(
        strip=True)
    except:
        print("Je n'ai pas trouvé de type de jugement pour "   title)
        active = "En activité"
    active = active[0:48]

    date = soup.select_one('td:-soup-contains("Date création entreprise")   td').get_text(
            strip=True)
    date = date[0:10]

    data.append([title, active, date])

df = pd.DataFrame(
    data,
    columns=["Title", "Active", "Date"],
)

print(df.to_markdown())

I would like first of all to separate the judgment and the date of judgment into two different data and to be able to compare the two dates. There is a business creation date and a closing date, so I would like to have the lifespan of the businesses, is that possible?


    | Title                       | Active                                | Date       |
|---:|:----------------------------|:--------------------------------------|:-----------|
|  0 | 1804 TRANSPORT              | Liquidation judiciaire le 07-01-2022- | 28-01-2013 |
|  1 | 235TH BARBER STREET ENGHIEN | Liquidation judiciaire le 28-01-2022- | 01-10-2017 |
|  2 | 2A PROTECT                  | Liquidation judiciaire le 17-01-2022- | 12-02-2021 |
|  3 | 2FNC                        | Liquidation judiciaire le 27-01-2022- | 01-12-1996 

I have 2 informations in the column Active and I want separate these. After this I want calculate the time between the two date. Thanks for your help !

CodePudding user response:

I only tried it with your first url, but inside your for loop, I would make this change:

title = soup.select_one("#identite_deno").text
start = list(soup.select_one('td:-soup-contains("Date création entreprise")   td'))[0].text.strip()
end = list(soup.select_one('td.red').stripped_strings)[0].split('le ')[1]
days = datetime.strptime(end, '%d-%m-%Y')-datetime.strptime(start, '%d-%m-%Y')
data.append([title, start, end,days.days])
  • Related