I am trying to scrape the links in a subsection of a wikipedia page using python. For example in this: https://en.wikipedia.org/wiki/Lists_of_video_games I want to get all the links under the section "By genre" only.
I have tried to use beautifulsoup but i am getting too much info, I need a way to limit my response to only that subsection.
It would be better if I could also get the subsections title, so for example all the links in "action", all the links in "sports" .. etc.
Any help or guidance would be appreciated
Thanks,
CodePudding user response:
Hope, the following example will be your desired output
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Lists_of_video_games'
res= requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
links = soup.select('.toclevel-1.tocsection-41> ul > li > a')
for link in links:
href= 'https://en.wikipedia.org/wiki/Lists_of_video_games' link.get('href')
print(href)
Output:
https://en.wikipedia.org/wiki/Lists_of_video_games#Action
https://en.wikipedia.org/wiki/Lists_of_video_games#Casual_and_puzzle
https://en.wikipedia.org/wiki/Lists_of_video_games#Role-playing
https://en.wikipedia.org/wiki/Lists_of_video_games#Simulation
https://en.wikipedia.org/wiki/Lists_of_video_games#Sports
https://en.wikipedia.org/wiki/Lists_of_video_games#Strategy
https://en.wikipedia.org/wiki/Lists_of_video_games#Other