specially hard to remove right whitespaces in text with R-CodePudding

I extracted some text from a web page.

But I have some whitespaces or speciual characters that I can not remove easily.

I tried this:

library(dplyr)
library(rvest)

url <- "http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1607-40412016000100014&lang=es"

page <- read_html(url)
referenes_whitout_end_spaces <- page %>%
  html_elements("p")  %>% 
  .[grepl("(Links)|(doi:)", as.character(.))] %>%
  html_text() %>% 
  gsub("[\n\t\b]", "", .) %>%
  gsub("\\[.*Links.*\\]", "", .) %>%
  gsub("\\s|\\n", " ", .) %>% 
trimws("both", whitespace = "[ \t\r\n\b]")

referenes_whitout_end_spaces

but the whitespaces at the end of the references stands.

how I can remove this whitespaces?

CodePudding user response：

The issue is that the HTML page contains a lot of   HTML entities standing for non-breaking spaces. These entities are converted to literal non-breaking spaces, \xA0.

Thus, you can simply add them to the trimws function:

trimws("both", whitespace = "[ \xA0\t\r\n\b]")

Or, if you want to support all Unicode whitespace:

trimws("both", whitespace = "\\p{Z} ")

CodePudding user response：

Those are some funky unicode whitespace characters. To get rid of them, copy them and paste them into the whitespace argument of trimws.

trimws(referenes_whitout_end_spaces, whitespace="[           ]")

The stuff inside the [] is pasted from the bad output.

CodePudding user response：

We could use str_squish:

str_squish removes whitespace from start and end of string and also reduces repeated whitespace inside a string:

library(stringr)

referenes_whitout_end_spaces %>%
  str_squish()