Home > front end >  Method to Scrape a python list of multiple URLs
Method to Scrape a python list of multiple URLs

Time:11-12

I scraped wikipedia pages for the URLs I need and appended it to an empty list in python. I now need to scrape every URL in my list for specific information, like date, coordinates etc.

Given the structure of the HTML code, parent/sub-parent structure, a lot of the information cannot be linked to by tag alone. Or can it? See fact box in the following link: https://en.wikipedia.org/wiki/1987_Maryland_train_collision. I am targeting my scraping on these fact boxes because most of them includes one.

I understand that you can put in a conditional statement to claim specific data from a set of data, with same HTML tag. However, I am not sure how to approach it.

So far i have the below:

urls_collected = #my list of urls to be scraped


for url in urls_collected:
        
        soup = BeautifulSoup(text, features="lxml")
        
        for item in soup.findAll('td',attrs={'class':'infobox-label'}):
            
            if item.find('td', attrs={'class':'infobox-data'})  == "date":
                print(item.find)
    
                date_info = item.get("infobox-data")
                print(date_info)

                #do something more..                          

Code doesn't run as expected. If i print(item.find) and print(date_info) i get:

<bound method Tag.find of <div class="mw-content-ltr" dir="ltr" lang="en"><h3>S</h3>
<ul><li><a href=.....

&

None 

Thank you for your time.

CodePudding user response:

The structure of what you're examining looks like this:

<tr>
  <th scope="row" class="infobox-label" style="white-space:nowrap;padding-right:0.65em;">Date</th>
  <td class="infobox-data" style="line-height:1.3em;">January 4, 1987 <br>1:30 PM</td>
</tr>
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

  • Note that the "infobox-label" is in a TH tag not a TD tag.
  • item.find is a method, you probably intended "print(item)"
  • Once you've found the TH tag, you will want to move to the TD tag to get the value. There are several ways to do this, I think the simplest is to reference "item.parent.td"

Maybe you're looking for something like this:

    for item in soup.findAll('th',attrs={'class':'infobox-label'}):
        
        if item.text  == "Date":
            print(item)

            date_info = item.parent.td.text
            print(date_info)

Alternatively just:

soup.select_one('.infobox').find('th', text="Date").parent.td.text.strip()

CodePudding user response:

A quick follow up. Solution given by Rusticus works, for all but Coordinates. I get the following:

 mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct{display:none}.mw-parser-output .longitude,.mw-parser-output .latitude{white-space:nowrap}50°5446″N 0°09′14″W / 50.91278°N 0.15389°W / 50.91278; -0.15389Coordinates: 50°5446″N 0°09′14″W / 50.91278°N 0.15389°W / 50.91278; -0.15389  

```                                                                                                                                                         I cannot discern from the HTML what is different here?
  • Related