Home > Blockchain >  How can I extract all sections of a Wikipedia page in plain text?
How can I extract all sections of a Wikipedia page in plain text?

Time:12-18

I have the following code in python, which extracts only the introduction of the article on "Artificial Intelligence", while instead I would like to extract all sub-sections (History, Goals ...)

import requests

def get_wikipedia_page(page_title):
  endpoint = "https://en.wikipedia.org/w/api.php"
  params = {
    "format": "json",
    "action": "query",
    "prop": "extracts",
    "exintro": "",
    "explaintext": "",
    "titles": page_title
  }
  response = requests.get(endpoint, params=params)
  data = response.json()
  pages = data["query"]["pages"]
  page_id = list(pages.keys())[0]
  return pages[page_id]["extract"]

page_title = "Artificial intelligence"
wikipedia_page = get_wikipedia_page(page_title)

Someone proposed to use another approach that parses html and uses BeautifulSoup to convert to text:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in 
line.split("  
"))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

This is not a good-enough solution, as it includes all text that appears on the website (like image text), and it includes citations in the text (e.g. [1]), while the first script removes them.

I suspect that the api of wikipedia should offer a more elegant solution, it would be rather weird if one can get only the first section?

CodePudding user response:

See Wikipedia's API documentation.

There is an API sandbox where you can try several parameters, e.g.:

  • action=parse
  • format=json
  • page=Pet door
  • prop=sections

To get all the sections of given page: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=Pet door&prop=sections&formatversion=2

This will respond with a JSON of searched page containing all sections:

{
    "parse": {
        "title": "Pet door",
        "pageid": 3276454,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Purpose",
                "number": "1",
                "index": "1",
                "fromtitle": "Pet_door",
                "byteoffset": 831,
                "anchor": "Purpose",
                "linkAnchor": "Purpose"
            }
}

(simplified, only first section kept)

To get the text of one of those sections, add section=1&sectiontitle=Purpose: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=Pet_door&section=1&sectiontitle=Purpose&formatversion=2

This retrieves the text (in HTML format):

"text": "<div class=\"mw-parser-output\"><h2><span class=\"mw-headline\" id=\"Purpose\">Purpose</span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[</span><a href=\"/w/index.php?title=Pet_door&amp;action=edit&amp;section=1\" title=\"Edit section: Purpose\">edit</a><span class=\"mw-editsection-bracket\">]</span></span></h2>\n<p>A pet door is found to be convenient by many owners of companion animals, especially dogs and cats, because it lets the pets come and go as they please, reducing the need for pet-owners to let or take the pet outside manually, and curtailing unwanted behaviour such as loud vocalisation to be let outside, scratching on doors or walls, and (especially in the case of dogs) <a href=\"/wiki/Excretion\" title=\"Excretion\">excreting</a> in the house. They also help to ensure that a pet left outdoors can safely get back into the house unattended, in the case of inclement weather.\n</p>\n<!-- \nNewPP limit report\nParsed by mw1386\nCached time: 20221217200829\nCache expiry: 1814400\nReduced expiry: false\nComplications: []\nCPU time usage: 0.003 seconds\nReal time usage: 0.005 seconds\nPreprocessor visited node count: 2/1000000\nPost‐expand include size: 0/2097152 bytes\nTemplate argument size: 0/2097152 bytes\nHighest expansion depth: 2/100\nExpensive parser function count: 0/500\nUnstrip recursion depth: 0/20\nUnstrip post‐expand size: 0/5000000 bytes\nNumber of Wikibase entities loaded: 0/400\n-->\n<!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00%    0.000      1 -total\n-->\n</div>",

CodePudding user response:

You missed to query number that's the problem.

there :

  {'1164': {'pageid': 1164, 'ns': 0, 'title': 'Artificial intelligence', 'extract': 'Artificial intelligence (AI) is intel... 

you need the change return pages[page_id]["extract"] with return pages[page_id]["title"]

  • Related