Home > Enterprise >  Web Scraping - Table Name
Web Scraping - Table Name

Time:03-21

New to webscraping.

I am trying to scrape a site. I recently learnt how to get information from tables, but I want to know how to get the table name. (I believe table name might be wrong word here but bear with me)

Eg - https://www.msc.com/che/about-us/our-fleet?page=1

MSC is shipping firm and I need to get the list of their fleet and information on each ship. I have written the following code that will retrieve the table data for each ship.

df <- MSCwp[i,1] %>% 
    read_html() %>% html_table()

MSCwp is the list url. This code gets me all the information I need about the ships listed in the webpage expect its name.

Is there any way to retrieve the name along with the table?

Eg - df for the above mentioned website will return 10 tables. (corresponding to the ships in the webpage). df[1] will have information about the ship Agamemnon but I am not sure how to retrieve the shipname along with the table.

CodePudding user response:

You need to pull the names out from the main page.

library(rvest)
library(dplyr) 

url <- "https://www.msc.com/che/about-us/our-fleet?page=1"
page <- read_html(url)

names <- page %>% html_elements("dd a") %>% html_text()  
names

[1] "AGAMEMNON"       "AGIOS DIMITRIOS" "ALABAMA"         "ALLEGRO"         "AMALTHEA"        "AMERICA"         "ANASTASIA"      
[8] "ANTWERP TRADER"  "ARCHIMIDIS"      "ARIES" 

In this case I am looking for the text in the "a" child node of the "dd" nodes.

  • Related