I am learning web scraping in r , and understand the HTML code.. but there is slightly some confusion here...
CODE 1 :
url <- "https://en.wikipedia.org/wiki/World_population"
ten_most_df <- read_html(url)
ten_most_populous <- ten_most_df %>%
html_table() %>%
.[[6]]
CODE 2 :
url <- "https://en.wikipedia.org/wiki/World_population"
ten_most_df <- read_html(url)
ten_most_populous <- ten_most_df %>%
html_nodes(xpath="/html/body/div[3]/div[3]/div[4]/div/table[5]") %>% html_table()
Are the methods use in code 1 and 2 the same as in code 1 , we are scraping the 6 node , however things are not clear to me Code 2 , as div[3] repeated twice. Can you please give some clarity on this. will be of great help.. thanks.
CodePudding user response:
body/div[3]/div[3]/div[4]
means the 4th div
child of the 3rd div
child of the 3rd div
child of the body element.
You really should be finding that out by reading a reference book on XPath, not by asking on StackOverflow.