I am trying to webscrape the text from pages 1-6 of this website:
Ideally, though, I am trying to extract each 'field' (not constructed as table) individually to eventually create a dataframe to then export them into an excel file. I can't seem to locate the individual parts of the html.
I can find and save the entire content div
(although I can only seem to save this on a loop so the 6 pages appear hundreds of times), but I can't find any html parts nested beneath this. My code is as follows, trying to find the first 'field':
import requests
from bs4 import BeautifulSoup
import re
search = 'Leprosary of Saint Lazarus'
for pages in range(6):
url = 'http://crusades-regesta.com/database'
url = '?search_api_views_fulltext='
url = '&field_institution_recipient=' search
url = '&field_grantor='
url = '&field_recepient='
url = '&field_year_1='
url = '&field_year='
url = '&field_term_type_field_term_title='
url = '&page=' str(pages)
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.find('div', {'class': 'view-content'})
for infos in content:
try:
rrr = infos.find('div', {'class': 'type type_18'}).text
except:
print("None found")
import pandas as pd
df = pd.DataFrame(columns=['rrr'])
with open('losl_test1.txt', 'w') as f:
dfAsString = df.to_string(header=False, index=False)
f.write(dfAsString)
f.close()
This code just prints "None found" because, I assume, it hasn't found anything else to print. I don't know if I am not finding the right html part or what. Any help would be much appreciated.
CodePudding user response:
I think that You wanted to make pagination using for loop and range
method and to grab RRR
value.I've done the next pages meaning pagination in long url.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://crusades-regesta.com/database?search_api_views_fulltext=&field_institution_recipient=&field_grantor=&field_recepient=&field_year_1=&field_year=&field_term_type_field_term_title=&page={page}&f[0]=field_institution_recipient:Leprosary of Saint Lazarus'
data=[]
for page in range(1,7):
req=requests.get(url.format(page=page))
soup = BeautifulSoup(req.content,'lxml')
for r in soup.select('[] span:nth-child(2)'):
rr=list(r.stripped_strings)[-1]
#print(rr)
data.append(rr)
df = pd.DataFrame(data,columns=['RRR'])
print(df)
#df.to_csv('data.csv',index=False)
Output:
RRR
0 489
1 494
2 504
3 503
4 513
5 515
6 545
7 551
8 563
9 576
10 570
11 626
12 635
13 647
14 661
15 735
16 724
17 721
18 748
19 777
20 833
21 848
22 863
23 865
24 895
25 903
26 956
27 1128
28 1125
29 1124
30 1156
31 1165
32 1198
33 1352
34 1734
35 2035
36 2094
37 2095
38 2218
39 2232
40 2256
CodePudding user response:
What happens?
Main issue here is that content = soup.find('div', {'class': 'view-content'})
is no ResultSet
and contains only a single element. Thats why your second loop only iterates once.
Also Caused by this behavior you will swap from beautifoulsoup
method find()
to python string method find()
and these two are operating in a different way - Without try/except
you will see the what is going on, it try to find a string:
for x in soup.find('div', {'class': 'view-content'}):
print(x.find('div'))
Output
...
-1
<div > <span >RRR: </span> <span ><div >Eleemosynary grant</div>2256</span> </div>
...
How to fix?
Select your elements more specific in this case the views-row
:
sections = soup.find_all('div', {'class': 'views-row'})
While you iterate each section you could select expected value:
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
print(section.select_one('div[class*="type_"]').text)
Example
Is scraping all the information and creates DataFrame
import requests
from bs4 import BeautifulSoup
import pandas as pd
search = 'Leprosary of Saint Lazarus'
data = []
for pages in range(6):
url = 'http://crusades-regesta.com/database'
url = '?search_api_views_fulltext='
url = '&field_institution_recipient=' search
url = '&field_grantor='
url = '&field_recepient='
url = '&field_year_1='
url = '&field_year='
url = '&field_term_type_field_term_title='
url = '&page=' str(pages)
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
d = {}
for row in section.select('div.views-field'):
d[row.span.text] = row.select_one('span:nth-of-type(2)').get_text('|',strip=True)
data.append(d)
df = pd.DataFrame(data)
### replacing : in header and set all to lower case
df.columns = df.columns.str.lower().str.replace(': ','')
### spliting values in rrr column in rrr and type
df[['type','rrr']] = df['rrr'].str.split("|",expand=True)
df
Output
rrr | type | year | initiator | recipient | institution | text | sources | comments |
---|---|---|---|---|---|---|---|---|
45 | Privilege/exemption | 1100 | Godfrey of Bouillon | Order of Saint Lazarus | Leprosary of Saint Lazarus | *†Jul. 15-18. Jerusalem. Godfrey of Bouillon grants privileges to the order of St Lazarus at Jerusalem. | *†Jul. 15-18. Jerusalem. Godfrey of Bouillon grants privileges to the order of St Lazarus at Jerusalem. | Mayer, UKJ 3:1467-9, no. App. II/7 |
82 | Council/ruling decisions/legislation | 1104 | Baldwin I | Order of Saint Lazarus | Leprosary of Saint Lazarus | *†May 26 – Dec. 24. King Baldwin I entrusts the military order of St Lazarus with custody of the city of Acre. | *†May 26 – Dec. 24. King Baldwin I entrusts the military order of St Lazarus with custody of the city of Acre. | Mayer, UKJ 3:1470-1, no. App.II/10 |
297 | Grant | 1131 | Baldwin II | Leprosary of Saint Lazarus in Jerusalem | Leprosary of Saint Lazarus | *Apr. 14 1118– Aug. 21 1131. King Baldwin II issues a charter in favour of the leprosary of St Lazarus at Jerusalem. | *Apr. 14 1118– Aug. 21 1131. King Baldwin II issues a charter in favour of the leprosary of St Lazarus at Jerusalem. | Mayer, UKJ 1:282-4, no. 120 |
415 | Eleemosynary grant | 1142 | Baldwinus Cesarensis | Leprosary of Saint Lazarus in Jerusalem | Leprosary of Saint Lazarus | *Mid 1134 – Dec. 24 1142. On his deathbed Baldwinus Cesarensis makes an eleemosynary grant, giving the leprosary of St Lazarus of a piece of land [between the Mt of Olives and the Red Cistern on the road to the River Jordan]. | *Mid 1134 – Dec. 24 1142. On his deathbed Baldwinus Cesarensis makes an eleemosynary grant, giving the leprosary of St Lazarus of a piece of land [between the Mt of Olives and the Red Cistern on the road to the River Jordan]. | Mayer, UKJ 1:332-3, 348, nos. 145, 165 |
...