I am currently working on reading the HTML file of this web page into R and processing it to extract useful data that creates a new dataframe. Here is the website:
https://www.worldometers.info/world-population/population-by-country/
Visual inspection of the web page text shows that the lines that contain data values all starts with '< td >'. So here is my code so far:
thepage<-readLines('https://www.worldometers.info/world-population/population-by-country/')
dataline <- grep('<td>', thepage)
dataline
This returns:
11
Which tells me all the data is in line 11. So I did this:
data <- thepage[11]
datalines <- grep('<td>', data)
datalines
This returns:
1
Which isn't helpful at all as "data" is still one massive line. How do I split this massive lines into multiple lines? My preferred dataframe would look something like this:
TIA.
CodePudding user response:
How about the following?
library(tidyverse)
library(rvest)
url <- 'https://www.worldometers.info/world-population/population-by-country/'
pg <- xml2::read_html(url) %>%
rvest::html_table() %>%
.[[1]]