Webscraping with Python and BS4: Can't find part of html page-CodePudding

I am trying to webscrape the text from pages 1-6 of this website:

Ideally, though, I am trying to extract each 'field' (not constructed as table) individually to eventually create a dataframe to then export them into an excel file. I can't seem to locate the individual parts of the html.

I can find and save the entire content div (although I can only seem to save this on a loop so the 6 pages appear hundreds of times), but I can't find any html parts nested beneath this. My code is as follows, trying to find the first 'field':

import requests
from bs4 import BeautifulSoup
import re

search = 'Leprosary of Saint Lazarus'

for pages in range(6):

    url = 'http://crusades-regesta.com/database'
    url  = '?search_api_views_fulltext='
    url  = '&field_institution_recipient='   search
    url  = '&field_grantor='
    url  = '&field_recepient='
    url  = '&field_year_1='
    url  = '&field_year='
    url  = '&field_term_type_field_term_title='
    url  = '&page='   str(pages)
    
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    content = soup.find('div', {'class': 'view-content'})

    for infos in content:
        try:
            rrr = infos.find('div', {'class': 'type type_18'}).text
        except:
            print("None found")

import pandas as pd
df = pd.DataFrame(columns=['rrr'])

with open('losl_test1.txt', 'w') as f:
    dfAsString = df.to_string(header=False, index=False)
    f.write(dfAsString)
    f.close()

This code just prints "None found" because, I assume, it hasn't found anything else to print. I don't know if I am not finding the right html part or what. Any help would be much appreciated.

CodePudding user response：

I think that You wanted to make pagination using for loop and range method and to grab RRR value.I've done the next pages meaning pagination in long url.

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'http://crusades-regesta.com/database?search_api_views_fulltext=&field_institution_recipient=&field_grantor=&field_recepient=&field_year_1=&field_year=&field_term_type_field_term_title=&page={page}&f[0]=field_institution_recipient:Leprosary of Saint Lazarus'
data=[]
for page in range(1,7):
    req=requests.get(url.format(page=page))

    soup = BeautifulSoup(req.content,'lxml')
    
    for r in soup.select('[] span:nth-child(2)'):

        rr=list(r.stripped_strings)[-1]
        #print(rr)
        
        data.append(rr)


df = pd.DataFrame(data,columns=['RRR'])
print(df)
#df.to_csv('data.csv',index=False)

Output:

CodePudding user response：

What happens?

Main issue here is that content = soup.find('div', {'class': 'view-content'}) is no ResultSet and contains only a single element. Thats why your second loop only iterates once.

Also Caused by this behavior you will swap from beautifoulsoup method find() to python string method find() and these two are operating in a different way - Without try/except you will see the what is going on, it try to find a string:

for x in soup.find('div', {'class': 'view-content'}):
    print(x.find('div'))

Output

...
-1
<div > <span >RRR: </span> <span ><div >Eleemosynary grant</div>2256</span> </div>
...

How to fix?

Select your elements more specific in this case the views-row:

sections = soup.find_all('div', {'class': 'views-row'})

While you iterate each section you could select expected value:

sections = soup.find_all('div', {'class': 'views-row'})

for section in sections:
    print(section.select_one('div[class*="type_"]').text)

Example

Is scraping all the information and creates DataFrame

import requests
from bs4 import BeautifulSoup
import pandas as pd

search = 'Leprosary of Saint Lazarus'

data = []

for pages in range(6):

    url = 'http://crusades-regesta.com/database'
    url  = '?search_api_views_fulltext='
    url  = '&field_institution_recipient='   search
    url  = '&field_grantor='
    url  = '&field_recepient='
    url  = '&field_year_1='
    url  = '&field_year='
    url  = '&field_term_type_field_term_title='
    url  = '&page='   str(pages)
    
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    sections = soup.find_all('div', {'class': 'views-row'})

    for section in sections:
        d = {}
        for row in section.select('div.views-field'):
            d[row.span.text] = row.select_one('span:nth-of-type(2)').get_text('|',strip=True)
        data.append(d)

df = pd.DataFrame(data)

### replacing : in header and set all to lower case
df.columns = df.columns.str.lower().str.replace(': ','')

### spliting values in rrr column in rrr and type
df[['type','rrr']] = df['rrr'].str.split("|",expand=True)

df

Output

rrr	type	year	initiator	recipient	institution	text	sources	comments
45	Privilege/exemption	1100	Godfrey of Bouillon	Order of Saint Lazarus	Leprosary of Saint Lazarus	*†Jul. 15-18. Jerusalem. Godfrey of Bouillon grants privileges to the order of St Lazarus at Jerusalem.	*†Jul. 15-18. Jerusalem. Godfrey of Bouillon grants privileges to the order of St Lazarus at Jerusalem.	Mayer, UKJ 3:1467-9, no. App. II/7
82	Council/ruling decisions/legislation	1104	Baldwin I	Order of Saint Lazarus	Leprosary of Saint Lazarus	*†May 26 – Dec. 24. King Baldwin I entrusts the military order of St Lazarus with custody of the city of Acre.	*†May 26 – Dec. 24. King Baldwin I entrusts the military order of St Lazarus with custody of the city of Acre.	Mayer, UKJ 3:1470-1, no. App.II/10
297	Grant	1131	Baldwin II	Leprosary of Saint Lazarus in Jerusalem	Leprosary of Saint Lazarus	*Apr. 14 1118– Aug. 21 1131. King Baldwin II issues a charter in favour of the leprosary of St Lazarus at Jerusalem.	*Apr. 14 1118– Aug. 21 1131. King Baldwin II issues a charter in favour of the leprosary of St Lazarus at Jerusalem.	Mayer, UKJ 1:282-4, no. 120
415	Eleemosynary grant	1142	Baldwinus Cesarensis	Leprosary of Saint Lazarus in Jerusalem	Leprosary of Saint Lazarus	*Mid 1134 – Dec. 24 1142. On his deathbed Baldwinus Cesarensis makes an eleemosynary grant, giving the leprosary of St Lazarus of a piece of land [between the Mt of Olives and the Red Cistern on the road to the River Jordan].	*Mid 1134 – Dec. 24 1142. On his deathbed Baldwinus Cesarensis makes an eleemosynary grant, giving the leprosary of St Lazarus of a piece of land [between the Mt of Olives and the Red Cistern on the road to the River Jordan].	Mayer, UKJ 1:332-3, 348, nos. 145, 165

...