Home > Software design >  Is it possible to capture certain links from a webpage and put them in a regex to be ignored?
Is it possible to capture certain links from a webpage and put them in a regex to be ignored?

Time:03-09

I want to count how many internal links a group of articles have – but I need to ignore some of them, that are listed in a category tag.

Right now, I can count how many internal links in each article.

links_bs4 = ['page1', 'page2']
data = []
pattern = re.compile("https://example.com/")
links = []

for item in links_bs4:
  page = requests.get(item)
  soup = BeautifulSoup(page.content, 'html.parser')
  title = soup.find('title')

  body_text = soup.find('div', class_='article-body')
  link_temp = [link.get('href') for link in corpo_do_texto.find_all('a', href=pattern)]

  data.append({'title': title.string, 'count links': len(link_temp)})
  links.extend(link_temp)

But I need to capture some links inside a tag:

categorys = []
categorys.append(soup.find('div', class_='category'))

And put it inside my pattern to be ignored too. So my pattern would be something like: pattern = re.compile("https://example.com/ and https://example.com/category_1 and https://example.com/category_2")

I know my example above is wrong. How do I achieve that?

So, in a page that have the following links: https://example.com/, https://example.com/category_1, https://example.com/category_2, https://example.com/page_1, https://example.com/page_2

I would catch only https://example.com/page_1, https://example.com/page_2 and then count them.

CodePudding user response:

For filtering the list of all the links in link_temp before adding to data you can just add this line of code:

filtered_links = [i for i in link_temp if not any(x in i for x in ignored_links)]

where ignored_links should be a list with all the words/links you don't want to catch.

  • Related