I’m extracing all the abstract links of this page: https://www.deswater.com/vol.php?vol=5&oth=5|1-3|May|2009
Then I save them in a list using this code:
from bs4 import BeautifulSoup
import requests
abstract_list = []
r = requests.get('https://www.deswater.com/vol.php?vol=5&oth=5|1-3|May|2009')
# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')
for link in soup.find_all('a', class_='testo_normale_rosso'):
if "show_abstract.php?varpdf=" in link.get('href'):
# TO ADD
baseurl = 'https://www.deswater.com/'
links=baseurl link.attrs['href'].replace('show_abstract.php?varpdf=','')
abstract_list.append(links)
print(abstract_list)
In the end I get the list, only there is one issue. If you check the link, the first data to be extracted doesn't have abstract link.
Hence, I want the code to detect that there is no abstract link and append('No Abstract')
to the list.
Current output:
['https://www.deswater.com/DWT_abstracts/vol_5/5_2009_1.pdf', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_6.pdf', ...]
What I want:
['No Abstract Link', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_1.pdf', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_6.pdf', ...]
Appreciate your help.
CodePudding user response:
The HTML markup in the page is malformed, so try to use html5lib
or lxml
parser:
import requests
from bs4 import BeautifulSoup
url = "https://www.deswater.com/vol.php?vol=5&oth=5|1-3|May|2009"
soup = BeautifulSoup(requests.get(url).content, "lxml")
out = []
for p in soup.select("font:has(hr)"):
link = p.find(
lambda tag: tag.name == "a" and "Abstract" in tag.text, recursive=False
)
out.append(
"https://www.deswater.com/" link["href"].split("=")[-1] if link else ""
)
print(out)
Prints:
[
"",
"https://www.deswater.com/DWT_abstracts/vol_5/5_2009_1.pdf",
"https://www.deswater.com/DWT_abstracts/vol_5/5_2009_6.pdf",
"https://www.deswater.com/DWT_abstracts/vol_5/5_2009_12.pdf",
"https://www.deswater.com/DWT_abstracts/vol_5/5_2009_19.pdf",
"https://www.deswater.com/DWT_abstracts/vol_5/5_2009_29.pdf",
"https://www.deswater.com/DWT_abstracts/vol_5/5_2009_34.pdf",
"https://www.deswater.com/DWT_abstracts/vol_5/5_2009_42.pdf",
"https://www.deswater.com/DWT_abstracts/vol_5/5_2009_48.pdf",
...and so on.
CodePudding user response:
Seems to be nearly the same issue like in https://stackoverflow.com/a/75201360/14460824 and you have to adapt or generalize it.
for e in soup.select('.testo_normale hr'):
fulltext = 'No Fulltext'
abstract = 'No Abstract'
for tag in e.find_previous_siblings():
if tag.name != 'i':
if tag.get('href') and tag.get('href').startswith('show_abstract'):
abstract = base_url tag.get('href')
if tag.get('href') and tag.get('href').startswith('fulltext'):
fulltext = base_url tag.get('href')
else:
break
Example
from bs4 import BeautifulSoup
import requests
base_url = 'https://www.deswater.com/'
soup = BeautifulSoup(requests.get('https://www.deswater.com/vol.php?vol=1&oth=1|1-3|January|2009').content)
data = []
for e in soup.select('.testo_normale hr'):
fulltext = 'No Fulltext'
abstract = 'No Abstract'
for tag in e.find_previous_siblings():
if tag.name != 'i':
if tag.get('href') and tag.get('href').startswith('show_abstract'):
abstract = base_url tag.get('href')
if tag.get('href') and tag.get('href').startswith('fulltext'):
fulltext = base_url tag.get('href')
else:
break
data.append({
'author': e.find_previous('i').text,
'fulltext': fulltext,
'abstract': abstract
})
data
Output
[{'author': 'Miriam Balaban',
'fulltext': 'No Fulltext',
'abstract': 'https://www.deswater.com/show_abstract.php?varpdf=DWT_abstracts/vol_1/1_2009_vii.pdf'},
{'author': 'W. Richard Bowen',
'fulltext': 'https://www.deswater.com/fulltext.php?abst=XFxEV1RfYWJzdHJhY3RzXFx2b2xfMVxcMV8yMDA5XzEucGRm&desc=k@1@kfontk@13@kfacek@7@kk@30@kGenevak@6@kk@13@kArialk@6@kk@13@kHelveticak@6@kk@13@ksank@35@kserifk@30@kk@13@ksizek@7@kk@30@k2k@30@kk@2@kk@1@kik@[email protected]@13@kRichardk@13@kBowenk@1@kk@4@kik@2@kk@1@kbrk@2@kWaterk@13@kengineeringk@13@kfork@13@kthek@13@kpromotionk@13@kofk@13@kpeacek@1@kbrk@2@k1k@15@k2009k@16@k1k@35@k6k@1@kbrk@4@kk@2@kk@1@kak@13@khrefk@7@kDWTk@12@kabstractsk@4@kvolk@12@k1k@4@k1k@12@k2009k@[email protected]@13@kclassk@7@kk@5@kk@30@ktestok@12@knormalek@12@krossok@5@kk@30@kk@13@ktargetk@7@kk@5@kk@30@kk@12@kblankk@5@kk@30@kk@2@kAbstractk@1@kk@4@kak@2@kk@1@kbrk@2@k&id23=RFdUX2FydGljbGVzL1REV1RfSV8wMV8wMS0wM190ZmphL1REV1RfQV8xMDUxMjg2NC9URFdUX0FfMTA1MTI4NjRfTy5wZGY=&type=1',
'abstract': 'https://www.deswater.com/show_abstract.php?varpdf=DWT_abstracts/vol_1/1_2009_1.pdf'},
{'author': 'Steven J. Duranceau',
'fulltext': 'https://www.deswater.com/fulltext.php?abst=XFxEV1RfYWJzdHJhY3RzXFx2b2xfMVxcMV8yMDA5XzcucGRm&desc=k@1@kfontk@13@kfacek@7@kk@30@kGenevak@6@kk@13@kArialk@6@kk@13@kHelveticak@6@kk@13@ksank@35@kserifk@30@kk@13@ksizek@7@kk@30@k2k@30@kk@2@kk@1@kik@2@kStevenk@[email protected]@13@kDuranceauk@1@kk@4@kik@2@kk@1@kbrk@2@kModelingk@13@kthek@13@kpermeatek@13@ktransientk@13@kresponsek@13@ktok@13@kperturbationsk@13@kfromk@13@ksteadyk@13@kstatek@13@kink@13@kak@13@knanofiltrationk@13@kprocessk@1@kbrk@2@k1k@15@k2009k@16@k7k@35@k16k@1@kbrk@4@kk@2@kk@1@kak@13@khrefk@7@kDWTk@12@kabstractsk@4@kvolk@12@k1k@4@k1k@12@k2009k@[email protected]@13@kclassk@7@kk@5@kk@30@ktestok@12@knormalek@12@krossok@5@kk@30@kk@13@ktargetk@7@kk@5@kk@30@kk@12@kblankk@5@kk@30@kk@2@kAbstractk@1@kk@4@kak@2@kk@1@kbrk@2@k&id23=RFdUX2FydGljbGVzL1REV1RfSV8wMV8wMS0wM190ZmphL1REV1RfQV8xMDUxMjg2NS9URFdUX0FfMTA1MTI4NjVfTy5wZGY=&type=1',
'abstract': 'https://www.deswater.com/show_abstract.php?varpdf=DWT_abstracts/vol_1/1_2009_7.pdf'},...]
CodePudding user response:
Not sure I understand your question correctly (it seems too easy) - nonetheless, here is one way to do it:
from bs4 import BeautifulSoup as bs
import requests
abstract_list = []
r = requests.get('https://www.deswater.com/vol.php?vol=5&oth=5|1-3|May|2009')
# Parsing the HTML
soup = bs(r.text, 'html.parser')
for link in soup.find_all('a', class_='testo_normale_rosso'):
if "show_abstract.php?varpdf=" in link.get('href'):
# TO ADD
baseurl = 'https://www.deswater.com/'
links=baseurl link.attrs['href'].replace('show_abstract.php?varpdf=','')
else:
links='Nothing, nada, zilch'
abstract_list.append(links)
print(abstract_list)
Result in terminal:
['Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_1.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_6.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_12.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_19.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_29.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_34.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_42.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_48.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_54.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_59.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_68.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_74.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_80.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_91.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_99.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_106.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_111.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_119.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_124.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_132.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_137.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_146.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_153.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_159.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_167.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_172.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_178.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_183.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_192.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_198.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_207.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_213.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_223.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_235.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_252.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_257.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_267.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_275.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_283.pdf', 'Nothing, nada, zilch']
Of course, you can append what you want to that list, when there's no link match.