Home > OS >  How to make python detect if there is no data while web scrapping?
How to make python detect if there is no data while web scrapping?

Time:01-30

I’m extracing all the abstract links of this page: https://www.deswater.com/vol.php?vol=5&oth=5|1-3|May|2009

Then I save them in a list using this code:

from bs4 import BeautifulSoup
import requests
    
abstract_list = []

r = requests.get('https://www.deswater.com/vol.php?vol=5&oth=5|1-3|May|2009')
# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')

for link in soup.find_all('a', class_='testo_normale_rosso'):
    if "show_abstract.php?varpdf=" in link.get('href'):
        # TO ADD
        baseurl = 'https://www.deswater.com/'
        links=baseurl link.attrs['href'].replace('show_abstract.php?varpdf=','')
        abstract_list.append(links)

print(abstract_list)

In the end I get the list, only there is one issue. If you check the link, the first data to be extracted doesn't have abstract link.

Hence, I want the code to detect that there is no abstract link and append('No Abstract') to the list.

Current output:

['https://www.deswater.com/DWT_abstracts/vol_5/5_2009_1.pdf', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_6.pdf', ...]

What I want:

['No Abstract Link', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_1.pdf', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_6.pdf', ...]

Appreciate your help.

CodePudding user response:

The HTML markup in the page is malformed, so try to use html5lib or lxml parser:

import requests
from bs4 import BeautifulSoup

url = "https://www.deswater.com/vol.php?vol=5&oth=5|1-3|May|2009"

soup = BeautifulSoup(requests.get(url).content, "lxml")

out = []
for p in soup.select("font:has(hr)"):
    link = p.find(
        lambda tag: tag.name == "a" and "Abstract" in tag.text, recursive=False
    )
    out.append(
        "https://www.deswater.com/"   link["href"].split("=")[-1] if link else ""
    )

print(out)

Prints:

[
    "",
    "https://www.deswater.com/DWT_abstracts/vol_5/5_2009_1.pdf",
    "https://www.deswater.com/DWT_abstracts/vol_5/5_2009_6.pdf",
    "https://www.deswater.com/DWT_abstracts/vol_5/5_2009_12.pdf",
    "https://www.deswater.com/DWT_abstracts/vol_5/5_2009_19.pdf",
    "https://www.deswater.com/DWT_abstracts/vol_5/5_2009_29.pdf",
    "https://www.deswater.com/DWT_abstracts/vol_5/5_2009_34.pdf",
    "https://www.deswater.com/DWT_abstracts/vol_5/5_2009_42.pdf",
    "https://www.deswater.com/DWT_abstracts/vol_5/5_2009_48.pdf",

...and so on.

CodePudding user response:

Seems to be nearly the same issue like in https://stackoverflow.com/a/75201360/14460824 and you have to adapt or generalize it.

for e in soup.select('.testo_normale hr'):

    fulltext = 'No Fulltext'
    abstract = 'No Abstract'
    
    for tag in e.find_previous_siblings():
        if tag.name != 'i':
            if tag.get('href') and tag.get('href').startswith('show_abstract'):
                abstract = base_url tag.get('href') 

            if tag.get('href') and tag.get('href').startswith('fulltext'):
                fulltext = base_url tag.get('href')
        else:
            break

Example

from bs4 import BeautifulSoup
import requests

base_url = 'https://www.deswater.com/'
soup = BeautifulSoup(requests.get('https://www.deswater.com/vol.php?vol=1&oth=1|1-3|January|2009').content)


data = []

for e in soup.select('.testo_normale hr'):

    fulltext = 'No Fulltext'
    abstract = 'No Abstract'
    
    for tag in e.find_previous_siblings():
        if tag.name != 'i':
            if tag.get('href') and tag.get('href').startswith('show_abstract'):
                abstract = base_url tag.get('href') 

            if tag.get('href') and tag.get('href').startswith('fulltext'):
                fulltext = base_url tag.get('href')
        else:
            break

    data.append({
        'author': e.find_previous('i').text,
        'fulltext': fulltext,
        'abstract': abstract
    })

data

Output

[{'author': 'Miriam Balaban',
  'fulltext': 'No Fulltext',
  'abstract': 'https://www.deswater.com/show_abstract.php?varpdf=DWT_abstracts/vol_1/1_2009_vii.pdf'},
 {'author': 'W. Richard Bowen',
  'fulltext': 'https://www.deswater.com/fulltext.php?abst=XFxEV1RfYWJzdHJhY3RzXFx2b2xfMVxcMV8yMDA5XzEucGRm&desc=k@1@kfontk@13@kfacek@7@kk@30@kGenevak@6@kk@13@kArialk@6@kk@13@kHelveticak@6@kk@13@ksank@35@kserifk@30@kk@13@ksizek@7@kk@30@k2k@30@kk@2@kk@1@kik@[email protected]@13@kRichardk@13@kBowenk@1@kk@4@kik@2@kk@1@kbrk@2@kWaterk@13@kengineeringk@13@kfork@13@kthek@13@kpromotionk@13@kofk@13@kpeacek@1@kbrk@2@k1k@15@k2009k@16@k1k@35@k6k@1@kbrk@4@kk@2@kk@1@kak@13@khrefk@7@kDWTk@12@kabstractsk@4@kvolk@12@k1k@4@k1k@12@k2009k@[email protected]@13@kclassk@7@kk@5@kk@30@ktestok@12@knormalek@12@krossok@5@kk@30@kk@13@ktargetk@7@kk@5@kk@30@kk@12@kblankk@5@kk@30@kk@2@kAbstractk@1@kk@4@kak@2@kk@1@kbrk@2@k&id23=RFdUX2FydGljbGVzL1REV1RfSV8wMV8wMS0wM190ZmphL1REV1RfQV8xMDUxMjg2NC9URFdUX0FfMTA1MTI4NjRfTy5wZGY=&type=1',
  'abstract': 'https://www.deswater.com/show_abstract.php?varpdf=DWT_abstracts/vol_1/1_2009_1.pdf'},
 {'author': 'Steven J. Duranceau',
  'fulltext': 'https://www.deswater.com/fulltext.php?abst=XFxEV1RfYWJzdHJhY3RzXFx2b2xfMVxcMV8yMDA5XzcucGRm&desc=k@1@kfontk@13@kfacek@7@kk@30@kGenevak@6@kk@13@kArialk@6@kk@13@kHelveticak@6@kk@13@ksank@35@kserifk@30@kk@13@ksizek@7@kk@30@k2k@30@kk@2@kk@1@kik@2@kStevenk@[email protected]@13@kDuranceauk@1@kk@4@kik@2@kk@1@kbrk@2@kModelingk@13@kthek@13@kpermeatek@13@ktransientk@13@kresponsek@13@ktok@13@kperturbationsk@13@kfromk@13@ksteadyk@13@kstatek@13@kink@13@kak@13@knanofiltrationk@13@kprocessk@1@kbrk@2@k1k@15@k2009k@16@k7k@35@k16k@1@kbrk@4@kk@2@kk@1@kak@13@khrefk@7@kDWTk@12@kabstractsk@4@kvolk@12@k1k@4@k1k@12@k2009k@[email protected]@13@kclassk@7@kk@5@kk@30@ktestok@12@knormalek@12@krossok@5@kk@30@kk@13@ktargetk@7@kk@5@kk@30@kk@12@kblankk@5@kk@30@kk@2@kAbstractk@1@kk@4@kak@2@kk@1@kbrk@2@k&id23=RFdUX2FydGljbGVzL1REV1RfSV8wMV8wMS0wM190ZmphL1REV1RfQV8xMDUxMjg2NS9URFdUX0FfMTA1MTI4NjVfTy5wZGY=&type=1',
  'abstract': 'https://www.deswater.com/show_abstract.php?varpdf=DWT_abstracts/vol_1/1_2009_7.pdf'},...]

CodePudding user response:

Not sure I understand your question correctly (it seems too easy) - nonetheless, here is one way to do it:

from bs4 import BeautifulSoup as bs
import requests
    
abstract_list = []

r = requests.get('https://www.deswater.com/vol.php?vol=5&oth=5|1-3|May|2009')
# Parsing the HTML
soup = bs(r.text, 'html.parser')

for link in soup.find_all('a', class_='testo_normale_rosso'):
    if "show_abstract.php?varpdf=" in link.get('href'):
        # TO ADD
        baseurl = 'https://www.deswater.com/'
        links=baseurl link.attrs['href'].replace('show_abstract.php?varpdf=','')
        
    else:
        links='Nothing, nada, zilch'
    abstract_list.append(links)

print(abstract_list)

Result in terminal:

['Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_1.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_6.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_12.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_19.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_29.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_34.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_42.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_48.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_54.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_59.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_68.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_74.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_80.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_91.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_99.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_106.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_111.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_119.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_124.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_132.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_137.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_146.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_153.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_159.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_167.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_172.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_178.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_183.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_192.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_198.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_207.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_213.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_223.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_235.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_252.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_257.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_267.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_275.pdf', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'Nothing, nada, zilch', 'https://www.deswater.com/DWT_abstracts/vol_5/5_2009_283.pdf', 'Nothing, nada, zilch']

Of course, you can append what you want to that list, when there's no link match.

  • Related