Home > database >  Scraping news articles using Selenium Python
Scraping news articles using Selenium Python

Time:11-20

I am Learning to scrape news articles from the website https://tribune.com.pk/pakistan/archives. The first thing is to scrape the link of every news article. Now the problem is that <a tag contains two href in it but I want to get the first href tag which I am unable to do I am attaching the html of that particular part The code I have written returns me 2 href tags but I only want the first one

def Url_Extraction():
    category_name = driver.find_element(By.XPATH, '//*[@id="main-section"]/h1')
    cat = category_name.text  # Save category name in variable
    print(f"{cat}")
    news_articles = driver.find_elements(By.XPATH,"//div[contains(@class,'flex-wrap')]//a")  
  
    for element in news_articles:
        URL = element.get_attribute('href')
        print(URL)
        Url.append(URL)
        Category.append(cat)
        current_time = time.time() - start_time
        print(f'{len(Url)} urls extracted')
        print(f'{len(Category)} categories extracted')
        print(f'Current Time: {current_time / 3600:.2f} hr, {current_time / 60:.2f} min, {current_time:.2f} sec',
              flush=True)

Moreover I am able to paginate but I can't get the full article by clicking the individual links given on the main page.

CodePudding user response:

You have to modify the below XPath:

Instead of this -

news_articles = driver.find_elements(By.XPATH,"//div[contains(@class,'flex-wrap')]//a")

Use this -

news_articles = driver.find_elements(By.XPATH,"//div[contains(@class,'flex-wrap')]/a")

  • Related