How can I use pivot_wider function in R in presence of implicit missing values?-CodePudding

I have got some data from a website. But it has a problem. It has only one column, which contains all the data, which should be in different columns. And it also has implicit missing values. The original data is like

structure(list(original_data = c("Title1", "Authors1", "Reference1 Publication Month Date, Year", 
"Abstract1", "Title2", "Authors2", "Reference2 Publication Month Date, Year", 
"Abstract2", "Title3", "Authors3", "Reference3 Publication Month Date, Year", 
"Title4", "Authors4", "Reference4 Publication Month Date, Year", 
"Abstract1")), class = "data.frame", row.names = c(NA, -15L))

The third item doesn't have "Abstract" for it, and there is no NA also in its place.

So, I want to spread the data in different columns. The expected format would be

structure(list(Titles_Data = c("Title1", "Title2", "Title3", 
"Title4"), Authors_Data = c("Authors1", "Authors2", "Authors3", 
"Authors4"), Details_Data = c("Reference1 Publication Month Date, Year", 
"Reference2 Publication Month Date, Year", "Reference3 Publication Month Date, Year", 
"Reference4 Publication Month Date, Year"), Abstracts_Data = c("Abstract1", 
"Abstract2", NA, "Abstract4")), class = "data.frame", row.names = c(NA, 
-4L))

How can I spread this data in this situation? The real data is of much larger size, around 1,700 rows.

CodePudding user response：

You can try the following solution. I also modified my first one to account for the Abstract_data values:

df %>%
  mutate(title = gsub('^([A-Z][a-z] )(\\d )(.*)', '\\1_data', original_data), 
         id = cumsum(grepl('Title\\d ', original_data))) %>%
  pivot_wider(id_cols = id,
              names_from = title, 
              values_from = original_data) %>%
  mutate(Abstract_data = ifelse(!is.na(Abstract_data), paste0(gsub('^([A-Z][a-z] )(\\d )', '\\1', Abstract_data), id), 
                                Abstract_data))


# A tibble: 4 × 5
     id Title_data Authors_data Reference_data                          Abstract_data
  <int> <chr>      <chr>        <chr>                                   <chr>        
1     1 Title1     Authors1     Reference1 Publication Month Date, Year Abstract1    
2     2 Title2     Authors2     Reference2 Publication Month Date, Year Abstract2    
3     3 Title3     Authors3     Reference3 Publication Month Date, Year NA           
4     4 Title4     Authors4     Reference4 Publication Month Date, Year Abstract4

CodePudding user response：

Here is another option. I imagine this might fail if the missing data is not the abstract:

library(tidyverse)

df  |> 
  group_by(group = cumsum(grepl("Title", original_data)))  |>
  summarise(txt = paste(original_data, collapse = "---")) |>
  separate(txt, into = c("Titles_Data", "Authors_Data", "Details_Data", "Abstracts_Data"),
           sep = "---")
#> # A tibble: 4 x 5
#>   group Titles_Data Authors_Data Details_Data                            Abstr~1
#>   <int> <chr>       <chr>        <chr>                                   <chr>  
#> 1     1 Title1      Authors1     Reference1 Publication Month Date, Year Abstra~
#> 2     2 Title2      Authors2     Reference2 Publication Month Date, Year Abstra~
#> 3     3 Title3      Authors3     Reference3 Publication Month Date, Year <NA>   
#> 4     4 Title4      Authors4     Reference4 Publication Month Date, Year Abstra~
#> # ... with abbreviated variable name 1: Abstracts_Data

This might be more flexible:

library(tidyverse)

df |>
  extract(col = original_data, 
          into = c("cat", "grp"), 
          regex = "(\\w )(\\d )", 
          remove = FALSE) |>
  pivot_wider(names_from = cat, values_from = original_data)
#> # A tibble: 4 x 5
#>   grp   Title  Authors  Reference                               Abstract 
#>   <chr> <chr>  <chr>    <chr>                                   <chr>    
#> 1 1     Title1 Authors1 Reference1 Publication Month Date, Year Abstract1
#> 2 2     Title2 Authors2 Reference2 Publication Month Date, Year Abstract2
#> 3 3     Title3 Authors3 Reference3 Publication Month Date, Year <NA>     
#> 4 4     Title4 Authors4 Reference4 Publication Month Date, Year Abstract4