Home > Software design >  Extract information from a structured text file to create a dataframe in R
Extract information from a structured text file to create a dataframe in R

Time:11-13

I need to organize the information from a long (and old) text file containing thousands of items into a dataframe. The information in the text file follows the same structure in all the items. My goal is to arrange each item in a different row of the dataframe.

Structure of the text file:

Title (number of books) Country
Date time (author) Page number CODES letter,letter...
Notes

An example of the content, showing the first 3 items:

Pride and Prejudice (5) United Kingdom
1981 10:23 h (Jane Austen) Page 241 CODES OB,IT,CA
Deposited by the G.M.W.

Brave New World (2) United Kingdom
1977 09:14 h (Aldous Huxley) Page 205 CODES OB,PU
Deposited by the E.L.

Wide Sargasso Sea  (1) Jamaica
1989 16:51 h (Jean Rhys) Page 183 CODES OB,CA
Sent to the N.U.C.

I need to extract the first 6 elements of each item (title, number, country, date, time, author) and ignore the rest. The desired dataframe would be:

Title NoBooks Country Date time Author
Pride and Prejudice 5 United Kingdom 1981 10:23 Jane Austen
Brave New World 2 United Kingdom 1977 09:14 JAldous Huxley
Wide Sargasso Sea 1 Jamaica 1989 16:51 Jean Rhys

I have just found two similar posts (converting multiple lines of text into a data frame and Converting text file into dataframe in R) but my database doesn't have key characters to be used as separators.

Is there a way to separate my elemets? I've found a solution using Python libraries, but I would like to do it with R. Any suggestions?

CodePudding user response:

Hope this could help you.

p.d. some column data types could be cast to numeric of date since these are all text.

data<-"Pride and Prejudice (5) United Kingdom
1981 10:23 h (Jane Austen) Page 241 CODES OB,IT,CA
Deposited by the G.M.W.

Brave New World (2) United Kingdom
1977 09:14 h (Aldous Huxley) Page 205 CODES OB,PU
Deposited by the E.L.

Wide Sargasso Sea  (1) Jamaica
1989 16:51 h (Jean Rhys) Page 183 CODES OB,CA
Sent to the N.U.C."

con <- textConnection(data, "r") # replace with:  con <- file("yourfile.txt") 
data <- readLines(con)
close(con)

l1 <- data[seq(1,length(data), 4)]
l2 <- data[seq(2,length(data), 4)]

d1 <- regmatches(l1, regexec("^(.*) \\((\\d )\\) (.*)", l1 ))
d2 <- regmatches(l2, regexec("^(\\d{4}) (\\d{2}:\\d{2}) h \\((.*)\\)", l2))
df <- as.data.frame(do.call(rbind, mapply(c, d1, d2, SIMPLIFY = F))[,c(-1,-5)])

colnames(df) <- c("Title","NoBooks","Country","Date","time","Author")

df
#>                 Title NoBooks        Country Date  time        Author
#> 1 Pride and Prejudice       5 United Kingdom 1981 10:23   Jane Austen
#> 2     Brave New World       2 United Kingdom 1977 09:14 Aldous Huxley
#> 3  Wide Sargasso Sea        1        Jamaica 1989 16:51     Jean Rhys

CodePudding user response:

A couple key steps here. First we just need to read the text into R in a somewhat useable format. Here I use read.delim just to get everything into one column. We keep blank lines to use as a delimiter later. Next we want to define some grouping variables for section and row number within the section to help us pivot. So we split each section by the blank line, and then group by section so we can use row_number to number the rows. Then we get rid of our helper columns, and pivot.

 library(tidyverse)

 f <- <path_to_file>

 f2 <- f %>% read.delim(., header = F, 
      blank.lines.skip = FALSE) %>%
    mutate(Sections = ifelse(V1 == "", TRUE, FALSE), 
        sectionID = cumsum(Sections)) %>%
    filter(Sections == FALSE) %>%
    group_by(sectionID) %>%
    mutate(RowID = row_number()) %>%
    select(-Sections) %>%
    pivot_wider(., id_cols = sectionID, names_from = RowID,
      values_from = V1, names_prefix = "Row")

From here we have a more useable data set, and you can use string functions to split out each section into its own column. My examples below use RegEx.

f3 <- f2 %>% mutate(Title = trimws(gsub(".*?\\()","\\1", Row1)),
        NumBooks = gsub(".*?\\((\\d )\\).*", "\\1", Row1),
        Country = trimws(gsub(".*\\)(.*$)", "\\1", Row1)))

# A tibble: 3 × 7
# Groups:   sectionID [3]
  sectionID Row1                                   Row2              Row3  Title NumBo…¹ Country
      <int> <chr>                                  <chr>             <chr> <chr> <chr>   <chr>  
1         0 Pride and Prejudice (5) United Kingdom 1981 10:23 h (Ja… Depo… Prid… 5       United…
2         1 Brave New World (2) United Kingdom     1977 09:14 h (Al… Depo… Brav… 2       United…
3         2 Wide Sargasso Sea  (1) Jamaica         1989 16:51 h (Je… Sent… Wide… 1       Jamaica

Bear in mind that in this short sample of data everything works nicely, but if you run into text structures like a title with a parenthetical in it, it may require further refinement of the patterns that you match.

Another way to split is using separate

f2 %>% separate(Row1, into = c("Title", "NumBooks", "Country"), sep = '[()]')

f2 %>% separate(Row2, 
into = c("Year", "Time", "Author", 
    "Page", "Codes"), 
'(?<=\\d{4})\\s(?=\\d{2})|(?<=\\s)h(?=\\s)|Page\\s?|\\bCODES\\s?') %>%
mutate_at(.vars = c("Time", "Author", "Page", "Codes"), ~ trimws(gsub('[()]',"", .x)))
  • Related