I extracted some text from a web page.
But I have some whitespaces or speciual characters that I can not remove easily.
I tried this:
library(dplyr)
library(rvest)
url <- "http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1607-40412016000100014&lang=es"
page <- read_html(url)
referenes_whitout_end_spaces <- page %>%
html_elements("p") %>%
.[grepl("(Links)|(doi:)", as.character(.))] %>%
html_text() %>%
gsub("[\n\t\b]", "", .) %>%
gsub("\\[.*Links.*\\]", "", .) %>%
gsub("\\s|\\n", " ", .) %>%
trimws("both", whitespace = "[ \t\r\n\b]")
referenes_whitout_end_spaces
but the whitespaces at the end of the references stands.
how I can remove this whitespaces?
CodePudding user response:
The issue is that the HTML page contains a lot of
HTML entities standing for non-breaking spaces. These entities are converted to literal non-breaking spaces, \xA0
.
Thus, you can simply add them to the trimws
function:
trimws("both", whitespace = "[ \xA0\t\r\n\b]")
Or, if you want to support all Unicode whitespace:
trimws("both", whitespace = "\\p{Z} ")
CodePudding user response:
Those are some funky unicode whitespace characters. To get rid of them, copy them and paste them into the whitespace argument of trimws
.
trimws(referenes_whitout_end_spaces, whitespace="[ ]")
The stuff inside the [] is pasted from the bad output.
CodePudding user response:
We could use str_squish
:
str_squish
removes whitespace from start and end of string and also reduces repeated whitespace inside a string:
library(stringr)
referenes_whitout_end_spaces %>%
str_squish()