I scraped wikipedia pages for the URLs I need and appended it to an empty list in python. I now need to scrape every URL in my list for specific information, like date, coordinates etc.
Given the structure of the HTML code, parent/sub-parent structure, a lot of the information cannot be linked to by tag alone. Or can it? See fact box in the following link: https://en.wikipedia.org/wiki/1987_Maryland_train_collision. I am targeting my scraping on these fact boxes because most of them includes one.
I understand that you can put in a conditional statement to claim specific data from a set of data, with same HTML tag. However, I am not sure how to approach it.
So far i have the below:
urls_collected = #my list of urls to be scraped
for url in urls_collected:
soup = BeautifulSoup(text, features="lxml")
for item in soup.findAll('td',attrs={'class':'infobox-label'}):
if item.find('td', attrs={'class':'infobox-data'}) == "date":
print(item.find)
date_info = item.get("infobox-data")
print(date_info)
#do something more..
Code doesn't run as expected. If i print(item.find) and print(date_info) i get:
<bound method Tag.find of <div class="mw-content-ltr" dir="ltr" lang="en"><h3>S</h3>
<ul><li><a href=.....
&
None
Thank you for your time.
CodePudding user response:
The structure of what you're examining looks like this:
<tr>
<th scope="row" class="infobox-label" style="white-space:nowrap;padding-right:0.65em;">Date</th>
<td class="infobox-data" style="line-height:1.3em;">January 4, 1987 <br>1:30 PM</td>
</tr>
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
- Note that the "infobox-label" is in a TH tag not a TD tag.
- item.find is a method, you probably intended "print(item)"
- Once you've found the TH tag, you will want to move to the TD tag to get the value. There are several ways to do this, I think the simplest is to reference "item.parent.td"
Maybe you're looking for something like this:
for item in soup.findAll('th',attrs={'class':'infobox-label'}):
if item.text == "Date":
print(item)
date_info = item.parent.td.text
print(date_info)
Alternatively just:
soup.select_one('.infobox').find('th', text="Date").parent.td.text.strip()
CodePudding user response:
A quick follow up. Solution given by Rusticus works, for all but Coordinates. I get the following:
mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct{display:none}.mw-parser-output .longitude,.mw-parser-output .latitude{white-space:nowrap}50°54′46″N 0°09′14″W / 50.91278°N 0.15389°W / 50.91278; -0.15389Coordinates: 50°54′46″N 0°09′14″W / 50.91278°N 0.15389°W / 50.91278; -0.15389
``` I cannot discern from the HTML what is different here?