Home > other >  Code giving blank output working on only particular Website while both websites are exactly same
Code giving blank output working on only particular Website while both websites are exactly same

Time:10-03

I scraping a district school website in which all website build are same every URL in websites are completely same except their names. The code I use only working on one school when I put it other school name it giving blank output anyone help me where I am going wrong. Here is the working code:

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'https://fairfaxhs.fcps.edu/staff-directory? 
field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,2):
    soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
    try:
        for u in ['https://fairfaxhs.fcps.edu' link.a.get('href') for link in 
        soup.table.select('tr td[]')]:
      
            soup2 = BeautifulSoup(requests.get(u).text,'lxml')
            d={
                'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True), 
                'Position': soup2.select_one('h1 div').get_text(strip=True),
                'contact_url': u
                }
            data.append(d)
    except:
        pass

df=pd.DataFrame(data).to_csv('fcps_school.csv',index=False)
print(df)

here is the other URL I am trying to scrap:

https://aldrines.fcps.edu/staff-directory?keywords=&field_last_name_from=&field_last_name_to=&items_per_page=10&page=
https://aldrines.fcps.edu

CodePudding user response:

I've scraped 10 pages as an example without changing anything from the existing code and it's working fine and also getting the same output in csv file.

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'https://fairfaxhs.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
    soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
    try:
        for u in ['https://fairfaxhs.fcps.edu' link.a.get('href') for link in 
        soup.table.select('tr td[]')]:
      
            soup2 = BeautifulSoup(requests.get(u).text,'lxml')
            d={
                'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True), 
                'Position': soup2.select_one('h1 div').get_text(strip=True),
                'contact_url': u
                }
            data.append(d)
    except:
        pass

df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)

Output

         Name                              Position                                        contact_url
0       Bouchera Abutaa               Instructional Assistant   https://fairfaxhs.fcps.edu/staff/bouchera-abutaa    
1      Margaret Aderton     Substitute Teacher - Regular Term  https://fairfaxhs.fcps.edu/staff/margaret-aderton    
2        Aja Adu-Gyamfi                  School Counselor, HS    https://fairfaxhs.fcps.edu/staff/aja-adu-gyamfi    
3          Paul Agyeman                          Custodian II      https://fairfaxhs.fcps.edu/staff/paul-agyeman    
4               Jin Ahn                  Food Services Worker           https://fairfaxhs.fcps.edu/staff/jin-ahn    
..                  ...                                   ...                                                ...    
95      Tiffany Haddock                  School Counselor, HS   https://fairfaxhs.fcps.edu/staff/tiffany-haddock    
96        Heather Hakes  Learning Disabilities Teacher, MS/HS     https://fairfaxhs.fcps.edu/staff/heather-hakes    
97       Gabrielle Hall  History & Social Studies Teacher, HS    https://fairfaxhs.fcps.edu/staff/gabrielle-hall    
98       Sydney Hamrick                   English Teacher, HS    https://fairfaxhs.fcps.edu/staff/sydney-hamrick    
99  Anne-Marie Hanapole                   Biology Teacher, HS  https://fairfaxhs.fcps.edu/staff/anne-marie-ha...    

[100 rows x 3 columns]

Update: Actually, success of webscraping not noly depends on good coding skill but also 50% success depends on good understanding the website.

Domain name:

  1. https://fairfaxhs.fcps.edu

and

2.https://aldrines.fcps.edu

aren't the same and h1 tag's attribute value is a bit difference, otherwise, the both website's structure is alike.

Working code:

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'https://aldrines.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
    soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
    try:
        for u in ['https://aldrines.fcps.edu' link.a.get('href') for link in soup.table.select('tr td[]')]:
      
            soup2 = BeautifulSoup(requests.get(u).text,'lxml')
            d={
                'Name': soup2.select_one('h1.node__title.fcps-color--dark7').get_text(strip=True), 
                'Position': soup2.select_one('h1 div').get_text(strip=True),
                'contact_url': u
                }
            data.append(d)
    except:
        pass

df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)

Output:

  Name  ...                                        contact_url
0   Jamileh Abu-Ghannam  ...  https://aldrines.fcps.edu/staff/jamileh-abu-gh...
1          Linda Adgate  ...       https://aldrines.fcps.edu/staff/linda-adgate
2           Rehab Ahmed  ...        https://aldrines.fcps.edu/staff/rehab-ahmed
3      Richard Amernick  ...   https://aldrines.fcps.edu/staff/richard-amernick
4             Laura Arm  ...          https://aldrines.fcps.edu/staff/laura-arm
..                  ...  ...                                                ...
95     Melissa Weinhaus  ...   https://aldrines.fcps.edu/staff/melissa-weinhaus
96      Kathryn Wheeler  ...    https://aldrines.fcps.edu/staff/kathryn-wheeler
97        Latoya Wilson  ...      https://aldrines.fcps.edu/staff/latoya-wilson
98          Shane Wolfe  ...        https://aldrines.fcps.edu/staff/shane-wolfe
99     Michael Woodring  ...   https://aldrines.fcps.edu/staff/michael-woodring

[100 rows x 3 columns]
  • Related