I had previously extract some information in the webpage using BeautifulSoup4: https://www.peakbagger.com/list.aspx?lid=5651
And I got a list of a href:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.peakbagger.com/list.aspx?lid=5651'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
a= soup.select("a:nth-of-type(1)")
a
But I would only wants the one with the links starting on 'peak.aspx?pid=10...'
How do I only print out the ones with 'peak.aspx?pid=10...', do I need to use a loop or split it?
Thanks.
CodePudding user response:
An approach could be to loop over your selection and just pick the links that contain the string peak.aspx?pid=:
[x['href'] for x in soup.select('a') if 'peak.aspx?pid=' in str(x)]
But you can also specify your selector
to get the result - This will give you only the second column from the table and its a tags:
soup.select('table.gray tr td:nth-of-type(2) a')
To get the links you have to loop over the result:
[x['href'] for x in soup.select('table.gray tr td:nth-of-type(2) a')]