I scraping a district school website in which all website build are same every URL in websites are completely same except their names. The code I use only working on one school when I put it other school name it giving blank output anyone help me where I am going wrong. Here is the working code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://fairfaxhs.fcps.edu/staff-directory?
field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,2):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://fairfaxhs.fcps.edu' link.a.get('href') for link in
soup.table.select('tr td[]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True),
'Position': soup2.select_one('h1 div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data).to_csv('fcps_school.csv',index=False)
print(df)
here is the other URL I am trying to scrap:
https://aldrines.fcps.edu/staff-directory?keywords=&field_last_name_from=&field_last_name_to=&items_per_page=10&page=
https://aldrines.fcps.edu
CodePudding user response:
I've scraped 10 pages as an example without changing anything from the existing code and it's working fine and also getting the same output in csv file.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://fairfaxhs.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://fairfaxhs.fcps.edu' link.a.get('href') for link in
soup.table.select('tr td[]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True),
'Position': soup2.select_one('h1 div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)
Output
Name Position contact_url
0 Bouchera Abutaa Instructional Assistant https://fairfaxhs.fcps.edu/staff/bouchera-abutaa
1 Margaret Aderton Substitute Teacher - Regular Term https://fairfaxhs.fcps.edu/staff/margaret-aderton
2 Aja Adu-Gyamfi School Counselor, HS https://fairfaxhs.fcps.edu/staff/aja-adu-gyamfi
3 Paul Agyeman Custodian II https://fairfaxhs.fcps.edu/staff/paul-agyeman
4 Jin Ahn Food Services Worker https://fairfaxhs.fcps.edu/staff/jin-ahn
.. ... ... ...
95 Tiffany Haddock School Counselor, HS https://fairfaxhs.fcps.edu/staff/tiffany-haddock
96 Heather Hakes Learning Disabilities Teacher, MS/HS https://fairfaxhs.fcps.edu/staff/heather-hakes
97 Gabrielle Hall History & Social Studies Teacher, HS https://fairfaxhs.fcps.edu/staff/gabrielle-hall
98 Sydney Hamrick English Teacher, HS https://fairfaxhs.fcps.edu/staff/sydney-hamrick
99 Anne-Marie Hanapole Biology Teacher, HS https://fairfaxhs.fcps.edu/staff/anne-marie-ha...
[100 rows x 3 columns]
Update: Actually, success of webscraping not noly depends on good coding skill but also 50% success depends on good understanding the website.
Domain name:
https://fairfaxhs.fcps.edu
and
2.https://aldrines.fcps.edu
aren't the same and h1 tag's attribute value is a bit difference, otherwise, the both website's structure is alike.
Working code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://aldrines.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://aldrines.fcps.edu' link.a.get('href') for link in soup.table.select('tr td[]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark7').get_text(strip=True),
'Position': soup2.select_one('h1 div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)
Output:
Name ... contact_url
0 Jamileh Abu-Ghannam ... https://aldrines.fcps.edu/staff/jamileh-abu-gh...
1 Linda Adgate ... https://aldrines.fcps.edu/staff/linda-adgate
2 Rehab Ahmed ... https://aldrines.fcps.edu/staff/rehab-ahmed
3 Richard Amernick ... https://aldrines.fcps.edu/staff/richard-amernick
4 Laura Arm ... https://aldrines.fcps.edu/staff/laura-arm
.. ... ... ...
95 Melissa Weinhaus ... https://aldrines.fcps.edu/staff/melissa-weinhaus
96 Kathryn Wheeler ... https://aldrines.fcps.edu/staff/kathryn-wheeler
97 Latoya Wilson ... https://aldrines.fcps.edu/staff/latoya-wilson
98 Shane Wolfe ... https://aldrines.fcps.edu/staff/shane-wolfe
99 Michael Woodring ... https://aldrines.fcps.edu/staff/michael-woodring
[100 rows x 3 columns]