Home > front end >  R readlines encoding issues
R readlines encoding issues

Time:04-20

I am trying to read ebook-text in the desired URL:

url <- "http://www.gutenberg.org/cache/epub/55/pg55.txt"
raw <- readLines(url, encoding = "UTF-8")

However, when I inspect the raw object, it return result like this:

> raw[1]
[1] "\037‹\b\bÿ†u_\002ÿpg55.txt.utf8.gzip"
> raw[2]
[1] "¾ûçþd‡ú\u008f]»ßð×ÿÝ\177ýïç¶=Æ\bøëÅ\u008fvbñ™\177úGÿÌwg{\177×äǾ³cÄEù§XÏo\017íÐØW\177ßM×Ø/ûu,íOm3¬\037óÞþÜ­Ÿlí~î¦õ#?÷ŸfëÁ\037Å2þÜ\035wöÙ\037ûãS{ÕKcõþÐ\u008dëÞn®¿QkØ\016C·Úë\031±€?6»n½ø0L9\001.Ô÷çÃê¼ã'ÿ«o.~ùM³ß÷½mE3.~hp\006Z.Ù?aE?@\022ørÿÔŸ'\u008d\037‹ùaš\032NËä—Ör÷8Ùà\027Ÿ‡V«õϱž\177hº£­Óï\037»cÃc7"
> raw[3]
[1] "W~ –\023\033½ø¦]÷‡väÃþ"

How should I read the text properly in this case?

CodePudding user response:

If you are open to using {readr} you can use the read_file() function to read the data:

library(readr)
url <- "http://www.gutenberg.org/cache/epub/55/pg55.txt"
raw <- read_file(url)

This gives a character vector of length 1. You can inspect object raw:

substr(raw, 1, 50)
[1] "The Project Gutenberg EBook of The Wonderful Wizar"
  • Related