I have a long word document that lists items like this:
- Item 1
- entry1
- entry2
- entry3
- Item 2
- entry1
- entry2
- entry3
- (etc...)
The items are species names, and the entries are corresponding location and date information, but that doesn't matter much now.
I am trying to get this extremely long document into a reasonable table/tibble object in R, for which my idea was to use:
library (stringr)
data <- readLines("data.txt")
test_data <- str_sub(data, 1, 3)
and then assign another vector with the "Item" identity of each element of "data" (i.e. what species each date location corresponds with). I was trying to use a for loop for this and test whether each line starts with " " or not, but I am stuck.
results <- vector (length = length(data))
for (i in 1:length(data)) {
if (test_data[i] != " ") {
results[i] = data[i]
} else {
while #here I am stuck
Thank you
CodePudding user response:
I think I've got something to start with. The idea is to load your text file as a single long string and then break it into pieces corresponding to Item entries and store it in a list. Lastly, use lapply
on the list to separate Item and entries.
filename <- "test.txt"
# read your file a single long string
step1 <- readChar(filename, file.info(filename)$size)
# find the pattern that separate each Item (from a copy/paste of the example it is "\r\n\r\n") and make a list of items
# with associated entries
step2 <- as.list(unlist(strsplit(step1, split = "\r\n\r\n")))
# lastly split the vectors from step2
step3 <- lapply(step2, function(x) unlist(strsplit(x, split = "\r\n ")))
Output:
> step3
[[1]]
[1] "Item 1" "entry1" "entry2" "entry3"
[[2]]
[1] "Item 2" "entry1" "entry2" "entry3"
From here you can start using "usual" tools to clean your data and organize it e.g.
df <- as.data.frame(do.call(rbind, step3))
df <- tidyr::pivot_longer(df, 2:ncol(df))
df <- df[, -2]
names(df) <- c("Items", "Entries")
df
# A tibble: 6 x 2
Items Entries
<chr> <chr>
1 Item 1 entry1
2 Item 1 entry2
3 Item 1 entry3
4 Item 2 entry1
5 Item 2 entry2
6 Item 2 entry3
CodePudding user response:
Here's a tidyverse approach based on the fact that "each entry starts with twelve spaces".
# fake data
obj <- c("Item 1",
" entry1",
" entry2",
" entry3",
" entry4",
"Item 2",
" entry1",
" entry2",
" entry3"
)
writeLines(obj, con = "data2.txt")
# read in and convert
library(tidyverse)
dat <- readLines("data.txt", skipNul = TRUE)
dat |>
enframe() |>
separate(
value,
into = c("item", "entry"),
sep = "\\s{12}",
convert = TRUE,
fill = "right"
) |>
mutate(item = na_if(item, "")) |>
fill(item, .direction = "down") |>
filter(!(is.na(entry)))