I am quite new to R. I have few text (.txt) files in a folder that have been converted from PDF with page break character (#12). I need to produce a data frame by reading these text files in R with condition that one row in R represents one PDF page. It means that every time there is a page break (\f), it will only then create a new row.
The problem is when the text file gets load into R, every new line became a new row and I do not want this. Please assist me on this. Thanks!
Some methods that I have tried are read.table and readLines.
CodePudding user response:
Base R:
vec <- c("a","b","\fd","e","\ff","g")
# vec <- readLines("file.txt")
out <- data.frame(page = sapply(split(vec, cumsum(grepl("^\f", vec))), paste, collapse = "\n"))
out
# page
# 0 a\nb
# 1 \fd\ne
# 2 \ff\ng
If you need the leading \f
removed, easily done with
out$page <- sub("^\f", "", out$page)
CodePudding user response:
Does something like this work?
library(tidyverse)
read_file("mytxt.txt") %>%
str_split("␌") %>%
unlist() %>%
as_tibble_col("data")
It just reads the file as raw text then splits afterwards. You may have to replace the splitting character with something else.