I have a problem, with how to get a specific text in a table. In this example the date of page creation on the Wikipedia page. For example in this link
https://en.wikipedia.org/wiki/United_States?action=info
I'm using beautifulsoup, but I'm still having trouble because the rest of the text is there. I just need the date of page creation only.
CodePudding user response:
Grep'ing returns a single line of HTML:
$ curl -s 'https://en.wikipedia.org/wiki/United_States?action=info' |
grep --color 'Date of page creation'
In this case iterating line by line and using a regex would suffice.
But let's stick with BS4, a good tool for the job.
1. iterate
Just loop over the soup.find_all('td')
tags
until you find one having td.text
matching "Date of page creation".
Then ask for the next tag,
and td.text
has the timestamp you want.
2. search for tag
Take advantage of the "mw-pageinfo-firsttime" id
on the <tr>
row, telling BS4 to look for that.
Read and discard a <td>
.
Read another datum and return its td.text
timestamp.
CodePudding user response:
there are some tables but just one table has information about the date and time. Fortunately, the row of date has a unique id that makes work easy.
so find the tr
by the id and get its content by .text
property or first get the second cell from this row and then get its content.
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/United_States?action=info'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
first_time = soup.find(id='mw-pageinfo-firsttime').find_all('td')[1]
last_time = soup.find(id='mw-pageinfo-lasttime').find_all('td')[1]
print(first_time.text)
print(last_time.text)