Home > OS >  Extract information from a loosely structured text file to create a dataframe in R
Extract information from a loosely structured text file to create a dataframe in R

Time:11-15

I need to organize the information from a long text file containing hundreds of items into a dataframe. The information in the text file is partially structured. My goal is to arrange each item in a different row of the dataframe.

Items are presented in two types of structure and both types of items sometimes have different number of lines.

Structure of the 1st item type:

Title (number of books) Country
Date time (author) Page number CODES letter,letter...
Notes (with different number of lines)
Ends with a few empty lines, and at times some of the newlines contain a number of spaces.

Structure of the 2nd item type, always beginning with 'Shelf Review':

Shelf Review () Book position in the library
Date time (name of worker) Pg. number CODES letter,letter...
Notes (with different number of lines)
Price (or sometimes written as Value)
More notes (with different number of lines)
Ends with a few empty lines, and at times some of the newlines contain a number of spaces.

An example of the content, showing 6 items:

Pride and Prejudice (5) United Kingdom
1981 10:23 h (Jane Austen) Page 241 CODES OB,IT,CA
Deposited by the G.M.W.

                      

Brave New World (2) United Kingdom
1977 09:14 h (Aldous Huxley) Page 205 CODES OB,PU
Deposited by
the E.L.
                                    

Wide Sargasso Sea  (1) Jamaica
1989 16:51 h (Jean Rhys) Page 183 CODES OB,CA
Sent
to the
N.U.C.
    
                  

Shelf Review  () Wil. 38b (002)
1992 13:55 h (Jeniffer Slack) Page 183 CODES OB,CA
Some text that 
is not useful
Price           12
more text that is not 
useful
                                  

Shelf Review  () Pet. 12.8t (030)
2003 12:37 h (Andrew Lee) Page 203 CODES OB,BU
Some text that 
is not useful
Value           25
more text that is not 
useful

                       

I need to extract some elements of each item and ignore the rest. The desired dataframe would be:

Title Number Location Date time Name
Pride and Prejudice 5 United Kingdom 1981 10:23 Jane Austen
Brave New World 2 United Kingdom 1977 09:14 Jaldous Huxley
Wide Sargasso Sea 1 Jamaica 1989 16:51 Jean Rhys
Shelf Review 12 Wil. 38b (002) 1992 13:55 Jeniffer Slack
Shelf Review 25 Pet. 12.8t (030) 2003 16:51 Andrew Lee

I found similar posts (one, two, three, four, five) but my database doesn't have either key characters to be used as separators or a neatly structured data.

Any suggestions would be much appreciated.

CodePudding user response:

These kinds of things take some time and patience to parse out. Here is one approach. I would read the data in line by line using readLines, then I would map out the chucks of text needed for each entry using line indices. After that, it is all about pulling out the data you want from the text blocks with some regex and checking along the way:

library(tidyverse)

#data saved in a text file
text <- readLines("text_example.txt")

tibble(start = c(0, which(text == "")),
       end = lead(start, default = length(text) 1),
       txt = map2(start, end, ~text[{.x   1}:{.y-1}])) |>
  filter(map_lgl(txt, ~ length(.x) >1)) |>
  mutate(Title = map_chr(txt, ~str_extract(.x[1], "^.*?(?=\\()")),
         Number = map2_chr(txt, Title, ~ifelse(.y == "Shelf Review  ",
                                               str_extract(.x[grepl("Price|Value", .x)],
                                                           "\\d "),
                                               str_extract(.x[1], "(?<=\\()\\d ?(?=\\))"))),
         Location = map_chr(txt, ~str_extract(.x[1], "(?<=\\)).*?$")),
         Date = map_chr(txt, ~str_extract(.x[2], "\\d{4}")),
         time = map_chr(txt, ~str_extract(.x[2], "\\d{2}:\\d{2}")),
         Name = map_chr(txt, ~str_extract(.x[2], "(?<=\\().*?(?=\\))"))) |>
  select(-c(start, end, txt))
#> # A tibble: 5 x 6
#>   Title                  Number Location            Date  time  Name          
#>   <chr>                  <chr>  <chr>               <chr> <chr> <chr>         
#> 1 "Pride and Prejudice " 5      " United Kingdom"   1981  10:23 Jane Austen   
#> 2 "Brave New World "     2      " United Kingdom"   1977  09:14 Aldous Huxley 
#> 3 "Wide Sargasso Sea  "  1      " Jamaica"          1989  16:51 Jean Rhys     
#> 4 "Shelf Review  "       12     " Wil. 38b (002)"   1992  13:55 Jeniffer Slack
#> 5 "Shelf Review  "       25     " Pet. 12.8t (030)" 2003  12:37 Andrew Lee
  • Related