Home > Back-end >  Convert text data to tabular format in r
Convert text data to tabular format in r

Time:12-22

I have a data which is like this:

 data <-
"--------------------------------------------------------\nJrId:
1\nJournalTitle: AADE editors' journal\nMedAbbr: AADE Ed J\nISSN
(Print): 0160-6999\nISSN (Online): \nIsoAbbr: AADE Ed J\nNlmId:
7708172\n--------------------------------------------------------\nJrId:
2\nJournalTitle: AANA journal\nMedAbbr: AANA J\nISSN (Print):
0094-6354\nISSN (Online): 2162-5239\nIsoAbbr: AANA J\nNlmId:
0431420\n--------------------------------------------------------\nJrId:
3\nJournalTitle: AARN news letter\nMedAbbr: AARN News Lett\nISSN
(Print): 0001-0197\nISSN (Online): \nIsoAbbr: AARN News Lett\nNlmId:
1251052\n--------------------------------------------------------"

Is there a way to change it to something like this?

# A tibble: 3 × 7
   JrID JournalTitle          MedAbbr        `ISSN (Print)` `ISSN (Online)` IsoAbbr          NlmId
  <dbl> <chr>                 <chr>          <chr>          <chr>           <chr>            <dbl>
1     1 AADE editors' journal AADE Ed J      0160-6999                      AADE Ed J      7708172
2     2 AANA journal          AANA J         0094-6354      2162-5239       AANA J          431420
3     3 AARN news letter      AARN News Lett 0001-0197                      AARN News Lett 1251052

I made this example manually but I want to do it automatically in r

CodePudding user response:

Using a bit of tidyr and dplyr you could do:

library(tidyr)
library(dplyr)

raw_text <- readLines(fil)

data.frame(raw = raw_text) |>
  # Get rid of empty rows
  filter(raw != "") |>
  # Add an id
  mutate(id = cumsum(grepl("^- $", raw))) |>
  # Get rid of separator lines
  filter(!grepl("^- $", raw)) |>
  separate(raw, into = c("category", "value"), sep = ":\\s?") |>
  pivot_wider(names_from = category, values_from = value) |> 
  select(-id)
#> # A tibble: 3 × 7
#>   JrId  JournalTitle          MedAbbr        ISSN (Print…¹ ISSN …² IsoAbbr NlmId
#>   <chr> <chr>                 <chr>          <chr>         <chr>   <chr>   <chr>
#> 1 1     AADE editors' journal AADE Ed J      0160-6999     ""      AADE E… 7708…
#> 2 2     AANA journal          AANA J         0094-6354     "2162-… AANA J  0431…
#> 3 3     AARN news letter      AARN News Lett 0001-0197     ""      AARN N… 1251…
#> # … with abbreviated variable names ¹​`ISSN (Print)`, ²​`ISSN (Online)`

DATA

text <- "\n--------------------------------------------------------\nJrId: 1\nJournalTitle: AADE editors' journal\nMedAbbr: AADE Ed J\nISSN (Print): 0160-6999\nISSN (Online): \nIsoAbbr: AADE Ed J\nNlmId: 7708172\n--------------------------------------------------------\nJrId: 2\nJournalTitle: AANA journal\nMedAbbr: AANA J\nISSN (Print): 0094-6354\nISSN (Online): 2162-5239\nIsoAbbr: AANA J\nNlmId: 0431420\n--------------------------------------------------------\nJrId: 3\nJournalTitle: AARN news letter\nMedAbbr: AARN News Lett\nISSN (Print): 0001-0197\nISSN (Online): \nIsoAbbr: AARN News Lett\nNlmId: 1251052\n--------------------------------------------------------\n"

# Create example file
fil <- tempfile(".txt")
writeLines(text, fil)

CodePudding user response:

Using {unglue} :

items <- strsplit(data, "\n?--------------------------------------------------------\n?")[[1]][-1]
items <- gsub("\n", " ", items)
pattern <- "JrId: {JrId} JournalTitle: {JournalTitle} MedAbbr: {MedAbbr} ISSN (Print): {ISSN_Print} ISSN (Online): {ISSN_Online} IsoAbbr: {IsoAbbr} NlmId: {nNlmId}"
unglue::unglue_data(items, pattern)
#>   JrId          JournalTitle        MedAbbr ISSN_Print ISSN_Online
#> 1    1 AADE editors' journal      AADE Ed J  0160-6999            
#> 2    2          AANA journal         AANA J  0094-6354   2162-5239
#> 3    3      AARN news letter AARN News Lett  0001-0197            
#>          IsoAbbr  nNlmId
#> 1      AADE Ed J 7708172
#> 2         AANA J 0431420
#> 3 AARN News Lett 1251052
  • Related