I am still quite a beginner, I have made myself a webscraping script that works to my satisfaction so far. With the script records are written to me in a csv file.
Now I would like to extend the script, but I can't find the right approach.
I would like to further filter the results that my script brings before writing them into the CSV. Namely, only those records should be written, where either in the field short_b or organization certain terms occur.
Something like this:
if short_b contains "term 1 or term2 or term 3.....)
or
if organization contains "term1 or term2 or term3 ...)
How can I implement something like this?
Here is the code so far of the subroutine in which I want to include the filtering. Don't be surprised some terms are german (I am german):
while True:
soup = BeautifulSoup(driver.page_source, "html.parser")
results = soup.find('div', {'id':'contentContainer'}).find('tbody').find_all('tr')
for i in results:
verdatum = i.find_all('td')[0].get_text().strip()
# create date via datetime
dateString = verdatum
dateFormatter = "%d.%m.%Y"
ver_datum_date = datetime.strptime(dateString, dateFormatter).date()
# check date
if ver_datum_date >= pruefdatum:
# create list
verdatum = i.find_all('td')[0].get_text().strip()
frist = i.find_all('td')[1].get_text().strip()
ausschreibung = i.find_all('td')[2].get_text().strip()
typ = i.find_all('td')[3].get_text().strip()
organisation = i.find_all('td')[4].get_text().strip()
i_info = {
'Vergabedatum': verdatum,
'Frist': frist,
'Organisation': organisation,
'Ausschreibung': ausschreibung,
'Typ': typ,
'Website': website,
'Prüfung ab': pruefdatum_format,
'Datei erzeugt': jetzt
}
ausschreibungsliste.append(i_info)
# print(ausschreibungsliste)
time.sleep(2)
# check pagination
if not soup.find('a', {'class':'browseLastGhost'}):
next=driver.find_element_by_xpath('//a[@]').click()
else:
print('Ausschreibungen gefunden :', len(ausschreibungsliste))
break
Thanks a lot.
With the help of HedgeHog is this the final code:
while True:
soup = BeautifulSoup(driver.page_source, "html.parser")
results = soup.find('div', {'id':'contentContainer'}).find('tbody').find_all('tr')
for i in results:
verdatum = i.find_all('td')[0].get_text().strip()
# create date via datetime
dateString = verdatum
dateFormatter = "%d.%m.%Y"
ver_datum_date = datetime.strptime(dateString, dateFormatter).date()
# check date
if ver_datum_date >= pruefdatum:
# create list
verdatum = i.find_all('td')[0].get_text().strip()
frist = i.find_all('td')[1].get_text().strip()
ausschreibung = i.find_all('td')[2].get_text().strip()
typ = i.find_all('td')[3].get_text().strip()
organisation = i.find_all('td')[4].get_text().strip()
i_info = {
'Vergabedatum': verdatum,
'Frist': frist,
'Organisation': organisation,
'Ausschreibung': ausschreibung,
'Typ': typ,
'Website': website,
'Prüfung ab': pruefdatum_format,
'Datei erzeugt': jetzt
}
if any(term in organisation for term in begriffeorganisation):
ausschreibungsliste.append(i_info)
if any(term in ausschreibung for term in begriffeausschreibung):
ausschreibungsliste.append(i_info)
# print(ausschreibungsliste)
time.sleep(2)
# check pagination
if not soup.find('a', {'class':'browseLastGhost'}):
next=driver.find_element_by_xpath('//a[@]').click()
else:
print('Ausschreibungen gefunden :', len(ausschreibungsliste))
break
CodePudding user response:
There is a wild mix in your namings so I tried to focus on the namings in code part it also is missing the short_b
from your example
A general approach could be to check with any()
if there is a term from a list in you extracted strings:
term_list = ['term1','term2', 'term3']
if any(term in organisation for term in term_list):
ausschreibungsliste.append(i_info)
or as mentioned you could filter your DataFrame
:
df = pd.DataFrame(ausschreibungsliste)
term_list = ['term1','term2', 'term3']
df[df['Organisation'].str.contains('|'.join(term_list))]