Home > Enterprise >  How to read text file with page break character in R
How to read text file with page break character in R

Time:11-27

I am quite new to R. I have few text (.txt) files in a folder that have been converted from PDF with page break character (#12). I need to produce a data frame by reading these text files in R with condition that one row in R represents one PDF page. It means that every time there is a page break (\f), it will only then create a new row.

The problem is when the text file gets load into R, every new line became a new row and I do not want this. Please assist me on this. Thanks!

Some methods that I have tried are read.table and readLines.

As you can see in lines 273 & 293, there is \f, so I need whatever that comes after \f to be in a row (which represents a page)

CodePudding user response:

Base R:

vec <- c("a","b","\fd","e","\ff","g")
# vec <- readLines("file.txt")
out <- data.frame(page = sapply(split(vec, cumsum(grepl("^\f", vec))), paste, collapse = "\n"))
out
#     page
# 0   a\nb
# 1 \fd\ne
# 2 \ff\ng

If you need the leading \f removed, easily done with

out$page <- sub("^\f", "", out$page)

CodePudding user response:

Does something like this work?

library(tidyverse)
read_file("mytxt.txt") %>%
  str_split("␌") %>%
  unlist() %>%
  as_tibble_col("data")

It just reads the file as raw text then splits afterwards. You may have to replace the splitting character with something else.

  • Related