Home > Software engineering >  How to scrape the nested data from Linkedin page using Selenium and Python
How to scrape the nested data from Linkedin page using Selenium and Python

Time:08-13

I am sitting with a project for my masters, where I would like to scrape LinkedIn. As far as I am now, I ran into a problem when I want to scrape the education pages of users (eg. https://www.linkedin.com/in/williamhgates/details/education/)

I would like to scrape all the educations of the users. In this example I would like to scrape "Harvard University" under mr1 hoverable-link-text t-bold, but I can't see to get to it.

Here's the HTML at code from Linkedin:

<li  id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0">
                        <!----><div >
  <div>
        <a  target="_self" href="https://www.linkedin.com/company/1646/">
        <div >
    <div >
<!---->      <img width="48" src="https://media-exp1.licdn.com/dms/image/C4E0BAQF5t62bcL0e9g/company-logo_100_100/0/1519855919126?e=1668643200&amp;v=beta&amp;t=BL0HxGNOasVbI3u39HBSL3n7H-yYADkJsqS3vafg-Ak" loading="lazy" height="48" alt="Harvard University logo" id="ember59" >
</div>
  </div>
    </a>

  </div>

  <div >
    <div >
          <a  target="_self" href="https://www.linkedin.com/company/1646/">
        <div >
            <span >
              <span aria-hidden="true"><!---->Harvard University<!----></span><span ><!---->Harvard University<!----></span>
            </span>
<!----><!----><!---->        </div>
<!---->          <span >
            <span aria-hidden="true"><!---->1973 - 1975<!----></span><span ><!---->1973 - 1975<!----></span>
          </span>
<!---->      </a>


<!---->
      <div >
<!---->      </div>
    </div>

      <div >
<!---->    <ul >
        <li >
                <div >
<!----><!----><!----></div>

        </li>
    </ul>
<!----></div>
  </div>
</div>

                </li>

I have tried the following code:

education = driver.find_element("xpath", '//*[@id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0"]/div/div[2]/div[1]/a/div/span/span[1]/').text
print(education)

I keep getting the error:

Message: no such element: Unable to locate element:

Can anybody help? I would love to have a script that loops through the educations, and save place of education and the year of educations.

CodePudding user response:

I would first get the list for the education section.

education_list = driver.find_element(By.CSS_SELECTOR, 'ul.pvs-list')
# loop through education_list for place and years
# would recommend relative locators for this task.
# find the image and get the first and second span with text inside of them.

I am adding further details to the code now. Please hold.

CodePudding user response:

You can use below properties to identify the school name list:

ancestorClass="optional-action-target-wrapper display-flex flex-column full-width"  tag="DIV"

Use these properties to identify the year list:

ancestorClass="optional-action-target-wrapper display-flex flex-column full-width"  tag="SPAN"

You may use above info to compose an XPath to locate the list, or if you don't mind using other python libraries, there is a sample code in GitHub to scrape the school and year.

CodePudding user response:

@Nadia S. you can try the following code. I have provided comments inline inside the code.

    @Test
    public void linkedInTest() {
        driver.get("https://www.linkedin.com");

        // You need to enter the credentials for your linkedin below for login
        driver.findElement(By.id("session_key")).sendKeys("");
        driver.findElement(By.id("session_password")).sendKeys("");
        driver.findElement(By.className("sign-in-form__submit-button")).click();
        driver.get("https://www.linkedin.com/in/williamhgates/details/education/");

        //Wait for the Education details to get populated. 
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(7));
        wait.until(ExpectedConditions.visibilityOfElementLocated(
                By.xpath("//div[@class = 'pvs-list__container']//div[@class = 'scaffold-finite-scroll__content']/ul")));
        
        //Take all elements showing education details in a list 
        List<WebElement> allEducation = driver.findElements(By
                .xpath("//div[@class = 'pvs-list__container']//div[@class = 'scaffold-finite-scroll__content']/ul/li"));
        //Extract details of each education item in the list. 
        //Below the details are directed to console. You can use a collection to store them.
        for (WebElement oneEducation : allEducation) {
            WebElement education = oneEducation.findElement(
                    By.xpath(".//*[contains(@class,\"mr1 hoverable-link-text\")]/span[@aria-hidden='true']"));
            System.out.print("Education - "   education.getText());
            try {
                WebElement educationType = oneEducation
                        .findElement(By.cssSelector(".t-14.t-normal span[aria-hidden='true']"));
                System.out.print("      Education Type - "   educationType.getText());
            } catch (NoSuchElementException e) {
                System.out.print("      Education Type - "   "is Not Specified");
            }
            try {
                WebElement educationYear = oneEducation
                        .findElement(By.cssSelector(".t-14.t-normal.t-black--light span[aria-hidden='true']"));
                System.out.println("        Education Year - "   educationYear.getText());
            } catch (NoSuchElementException e) {
                System.out.println("        Education Year - "   "is Not Specified");
            }
        }

    }

CodePudding user response:

To extract the text Harvard University ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "ul.pvs-list>li span.hoverable-link-text span"))).text)
    
  • Using XPATH:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 'hoverable-link-text')]//span"))).text)
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

  • Related