Home > OS >  Webscraping with Python BeautifulSoup - further search for terms in filtered soup
Webscraping with Python BeautifulSoup - further search for terms in filtered soup

Time:07-03

I am still quite a beginner, I have made myself a webscraping script that works to my satisfaction so far. With the script records are written to me in a csv file.

Now I would like to extend the script, but I can't find the right approach.

I would like to further filter the results that my script brings before writing them into the CSV. Namely, only those records should be written, where either in the field short_b or organization certain terms occur.

Something like this:

if short_b contains "term 1 or term2 or term 3.....)

or

if organization contains "term1 or term2 or term3 ...)

How can I implement something like this?

Here is the code so far of the subroutine in which I want to include the filtering. Don't be surprised some terms are german (I am german):

while True:

        soup = BeautifulSoup(driver.page_source, "html.parser")
        results = soup.find('div', {'id':'contentContainer'}).find('tbody').find_all('tr')

        for i in results:
                    
            verdatum = i.find_all('td')[0].get_text().strip()
           
            # create date via datetime 
            dateString = verdatum
            dateFormatter = "%d.%m.%Y"
            ver_datum_date = datetime.strptime(dateString, dateFormatter).date()

            # check date
            if ver_datum_date >= pruefdatum:
                            
                # create list 
                verdatum = i.find_all('td')[0].get_text().strip()         
                frist = i.find_all('td')[1].get_text().strip()
                ausschreibung = i.find_all('td')[2].get_text().strip()
                typ = i.find_all('td')[3].get_text().strip()
                organisation = i.find_all('td')[4].get_text().strip()

                i_info = {
                    'Vergabedatum': verdatum,
                    'Frist':  frist,
                    'Organisation': organisation,
                    'Ausschreibung': ausschreibung,
                    'Typ': typ,
                    'Website': website,
                    'Prüfung ab': pruefdatum_format,
                    'Datei erzeugt': jetzt

                }
                ausschreibungsliste.append(i_info)
        
        
        # print(ausschreibungsliste)
        time.sleep(2)

        # check pagination
        if not soup.find('a', {'class':'browseLastGhost'}):
            next=driver.find_element_by_xpath('//a[@]').click()
        else:
            print('Ausschreibungen gefunden :', len(ausschreibungsliste))
            break

Thanks a lot.

With the help of HedgeHog is this the final code:

while True:

        soup = BeautifulSoup(driver.page_source, "html.parser")
        results = soup.find('div', {'id':'contentContainer'}).find('tbody').find_all('tr')

        for i in results:
                    
            verdatum = i.find_all('td')[0].get_text().strip()
           
            # create date via datetime 
            dateString = verdatum
            dateFormatter = "%d.%m.%Y"
            ver_datum_date = datetime.strptime(dateString, dateFormatter).date()

            # check date
            if ver_datum_date >= pruefdatum:
                            
                # create list 
                verdatum = i.find_all('td')[0].get_text().strip()         
                frist = i.find_all('td')[1].get_text().strip()
                ausschreibung = i.find_all('td')[2].get_text().strip()
                typ = i.find_all('td')[3].get_text().strip()
                organisation = i.find_all('td')[4].get_text().strip()

                i_info = {
                    'Vergabedatum': verdatum,
                    'Frist':  frist,
                    'Organisation': organisation,
                    'Ausschreibung': ausschreibung,
                    'Typ': typ,
                    'Website': website,
                    'Prüfung ab': pruefdatum_format,
                    'Datei erzeugt': jetzt

                }
                
                if any(term in organisation for term in begriffeorganisation):
                    ausschreibungsliste.append(i_info)
                if any(term in ausschreibung for term in begriffeausschreibung):
                    ausschreibungsliste.append(i_info)
        
        
        # print(ausschreibungsliste)
        time.sleep(2)

        # check pagination
        if not soup.find('a', {'class':'browseLastGhost'}):
            next=driver.find_element_by_xpath('//a[@]').click()
        else:
            print('Ausschreibungen gefunden :', len(ausschreibungsliste))
            break

CodePudding user response:

There is a wild mix in your namings so I tried to focus on the namings in code part it also is missing the short_b from your example

A general approach could be to check with any() if there is a term from a list in you extracted strings:

term_list = ['term1','term2', 'term3']

if any(term in organisation for term in term_list):
     ausschreibungsliste.append(i_info)

or as mentioned you could filter your DataFrame:

df = pd.DataFrame(ausschreibungsliste)

term_list = ['term1','term2', 'term3']
df[df['Organisation'].str.contains('|'.join(term_list))]
  • Related