New to webscraping.
I am trying to scrape a site. I recently learnt how to get information from tables, but I want to know how to get the table name. (I believe table name might be wrong word here but bear with me)
Eg - https://www.msc.com/che/about-us/our-fleet?page=1
MSC is shipping firm and I need to get the list of their fleet and information on each ship. I have written the following code that will retrieve the table data for each ship.
df <- MSCwp[i,1] %>%
read_html() %>% html_table()
MSCwp is the list url. This code gets me all the information I need about the ships listed in the webpage expect its name.
Is there any way to retrieve the name along with the table?
Eg - df for the above mentioned website will return 10 tables. (corresponding to the ships in the webpage). df[1] will have information about the ship Agamemnon but I am not sure how to retrieve the shipname along with the table.
CodePudding user response:
You need to pull the names out from the main page.
library(rvest)
library(dplyr)
url <- "https://www.msc.com/che/about-us/our-fleet?page=1"
page <- read_html(url)
names <- page %>% html_elements("dd a") %>% html_text()
names
[1] "AGAMEMNON" "AGIOS DIMITRIOS" "ALABAMA" "ALLEGRO" "AMALTHEA" "AMERICA" "ANASTASIA"
[8] "ANTWERP TRADER" "ARCHIMIDIS" "ARIES"
In this case I am looking for the text in the "a" child node of the "dd" nodes.