Home > Enterprise >  Separate a massive line into multiple lines to create a dataframe from a HTML?
Separate a massive line into multiple lines to create a dataframe from a HTML?

Time:10-03

I am currently working on reading the HTML file of this web page into R and processing it to extract useful data that creates a new dataframe. Here is the website:

https://www.worldometers.info/world-population/population-by-country/

Visual inspection of the web page text shows that the lines that contain data values all starts with '< td >'. So here is my code so far:

thepage<-readLines('https://www.worldometers.info/world-population/population-by-country/')

dataline <- grep('<td>', thepage)
dataline

This returns:

11

Which tells me all the data is in line 11. So I did this:

data <- thepage[11]
datalines <- grep('<td>', data)
datalines

This returns:

1

Which isn't helpful at all as "data" is still one massive line. How do I split this massive lines into multiple lines? My preferred dataframe would look something like this:

enter image description here

enter image description here

TIA.

CodePudding user response:

How about the following?

library(tidyverse)
library(rvest)
    
url <- 'https://www.worldometers.info/world-population/population-by-country/'

pg <- xml2::read_html(url) %>%
  rvest::html_table() %>%
  .[[1]]
  • Related