Home > Blockchain >  Webscraping with Python and BS4: Can't find part of html page
Webscraping with Python and BS4: Can't find part of html page

Time:04-08

I am trying to webscrape the text from pages 1-6 of this website:

Ideally, though, I am trying to extract each 'field' (not constructed as table) individually to eventually create a dataframe to then export them into an excel file. I can't seem to locate the individual parts of the html.

I can find and save the entire content div (although I can only seem to save this on a loop so the 6 pages appear hundreds of times), but I can't find any html parts nested beneath this. My code is as follows, trying to find the first 'field':

import requests
from bs4 import BeautifulSoup
import re

search = 'Leprosary of Saint Lazarus'

for pages in range(6):

    url = 'http://crusades-regesta.com/database'
    url  = '?search_api_views_fulltext='
    url  = '&field_institution_recipient='   search
    url  = '&field_grantor='
    url  = '&field_recepient='
    url  = '&field_year_1='
    url  = '&field_year='
    url  = '&field_term_type_field_term_title='
    url  = '&page='   str(pages)
    
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    content = soup.find('div', {'class': 'view-content'})

    for infos in content:
        try:
            rrr = infos.find('div', {'class': 'type type_18'}).text
        except:
            print("None found")

import pandas as pd
df = pd.DataFrame(columns=['rrr'])

with open('losl_test1.txt', 'w') as f:
    dfAsString = df.to_string(header=False, index=False)
    f.write(dfAsString)
    f.close()

This code just prints "None found" because, I assume, it hasn't found anything else to print. I don't know if I am not finding the right html part or what. Any help would be much appreciated.

CodePudding user response:

I think that You wanted to make pagination using for loop and range method and to grab RRR value.I've done the next pages meaning pagination in long url.

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'http://crusades-regesta.com/database?search_api_views_fulltext=&field_institution_recipient=&field_grantor=&field_recepient=&field_year_1=&field_year=&field_term_type_field_term_title=&page={page}&f[0]=field_institution_recipient:Leprosary of Saint Lazarus'
data=[]
for page in range(1,7):
    req=requests.get(url.format(page=page))

    soup = BeautifulSoup(req.content,'lxml')
    
    for r in soup.select('[] span:nth-child(2)'):

        rr=list(r.stripped_strings)[-1]
        #print(rr)
        
        data.append(rr)


df = pd.DataFrame(data,columns=['RRR'])
print(df)
#df.to_csv('data.csv',index=False)

Output:

RRR
0    489
1    494
2    504
3    503
4    513
5    515
6    545
7    551
8    563
9    576
10   570
11   626
12   635
13   647
14   661
15   735
16   724
17   721
18   748
19   777
20   833
21   848
22   863
23   865
24   895
25   903
26   956
27  1128
28  1125
29  1124
30  1156
31  1165
32  1198
33  1352
34  1734
35  2035
36  2094
37  2095
38  2218
39  2232
40  2256

CodePudding user response:

What happens?

Main issue here is that content = soup.find('div', {'class': 'view-content'}) is no ResultSet and contains only a single element. Thats why your second loop only iterates once.

Also Caused by this behavior you will swap from beautifoulsoup method find() to python string method find() and these two are operating in a different way - Without try/except you will see the what is going on, it try to find a string:

for x in soup.find('div', {'class': 'view-content'}):
    print(x.find('div'))

Output

...
-1
<div > <span >RRR: </span> <span ><div >Eleemosynary grant</div>2256</span> </div>
...

How to fix?

Select your elements more specific in this case the views-row:

sections = soup.find_all('div', {'class': 'views-row'}) 

While you iterate each section you could select expected value:

sections = soup.find_all('div', {'class': 'views-row'})

for section in sections:
    print(section.select_one('div[class*="type_"]').text)
Example

Is scraping all the information and creates DataFrame

import requests
from bs4 import BeautifulSoup
import pandas as pd

search = 'Leprosary of Saint Lazarus'

data = []

for pages in range(6):

    url = 'http://crusades-regesta.com/database'
    url  = '?search_api_views_fulltext='
    url  = '&field_institution_recipient='   search
    url  = '&field_grantor='
    url  = '&field_recepient='
    url  = '&field_year_1='
    url  = '&field_year='
    url  = '&field_term_type_field_term_title='
    url  = '&page='   str(pages)
    
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    sections = soup.find_all('div', {'class': 'views-row'})

    for section in sections:
        d = {}
        for row in section.select('div.views-field'):
            d[row.span.text] = row.select_one('span:nth-of-type(2)').get_text('|',strip=True)
        data.append(d)

df = pd.DataFrame(data)

### replacing : in header and set all to lower case
df.columns = df.columns.str.lower().str.replace(': ','')

### spliting values in rrr column in rrr and type
df[['type','rrr']] = df['rrr'].str.split("|",expand=True)

df
Output
rrr type year initiator recipient institution text sources comments
45 Privilege/exemption 1100 Godfrey of Bouillon Order of Saint Lazarus Leprosary of Saint Lazarus *†Jul. 15-18. Jerusalem. Godfrey of Bouillon grants privileges to the order of St Lazarus at Jerusalem. *†Jul. 15-18. Jerusalem. Godfrey of Bouillon grants privileges to the order of St Lazarus at Jerusalem. Mayer, UKJ 3:1467-9, no. App. II/7
82 Council/ruling decisions/legislation 1104 Baldwin I Order of Saint Lazarus Leprosary of Saint Lazarus *†May 26 – Dec. 24. King Baldwin I entrusts the military order of St Lazarus with custody of the city of Acre. *†May 26 – Dec. 24. King Baldwin I entrusts the military order of St Lazarus with custody of the city of Acre. Mayer, UKJ 3:1470-1, no. App.II/10
297 Grant 1131 Baldwin II Leprosary of Saint Lazarus in Jerusalem Leprosary of Saint Lazarus *Apr. 14 1118– Aug. 21 1131. King Baldwin II issues a charter in favour of the leprosary of St Lazarus at Jerusalem. *Apr. 14 1118– Aug. 21 1131. King Baldwin II issues a charter in favour of the leprosary of St Lazarus at Jerusalem. Mayer, UKJ 1:282-4, no. 120
415 Eleemosynary grant 1142 Baldwinus Cesarensis Leprosary of Saint Lazarus in Jerusalem Leprosary of Saint Lazarus *Mid 1134 – Dec. 24 1142. On his deathbed Baldwinus Cesarensis makes an eleemosynary grant, giving the leprosary of St Lazarus of a piece of land [between the Mt of Olives and the Red Cistern on the road to the River Jordan]. *Mid 1134 – Dec. 24 1142. On his deathbed Baldwinus Cesarensis makes an eleemosynary grant, giving the leprosary of St Lazarus of a piece of land [between the Mt of Olives and the Red Cistern on the road to the River Jordan]. Mayer, UKJ 1:332-3, 348, nos. 145, 165

...

  • Related