Home > database >  How to get Date of page creation on wikipedia page using python?
How to get Date of page creation on wikipedia page using python?

Time:06-25

I have a problem, with how to get a specific text in a table. In this example the date of page creation on the Wikipedia page. For example in this link

https://en.wikipedia.org/wiki/United_States?action=info

I'm using beautifulsoup, but I'm still having trouble because the rest of the text is there. I just need the date of page creation only.

CodePudding user response:

Grep'ing returns a single line of HTML:

$ curl -s 'https://en.wikipedia.org/wiki/United_States?action=info' |
    grep --color 'Date of page creation' 

In this case iterating line by line and using a regex would suffice.

But let's stick with BS4, a good tool for the job.

1. iterate

Just loop over the soup.find_all('td') tags until you find one having td.text matching "Date of page creation". Then ask for the next tag, and td.text has the timestamp you want.

2. search for tag

Take advantage of the "mw-pageinfo-firsttime" id on the <tr> row, telling BS4 to look for that. Read and discard a <td>. Read another datum and return its td.text timestamp.

CodePudding user response:

there are some tables but just one table has information about the date and time. Fortunately, the row of date has a unique id that makes work easy. so find the tr by the id and get its content by .text property or first get the second cell from this row and then get its content.

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/United_States?action=info'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

first_time = soup.find(id='mw-pageinfo-firsttime').find_all('td')[1]
last_time = soup.find(id='mw-pageinfo-lasttime').find_all('td')[1]

print(first_time.text)
print(last_time.text)
  • Related