I have got some data from a website. But it has a problem. It has only one column, which contains all the data, which should be in different columns. And it also has implicit missing values. The original data is like
structure(list(original_data = c("Title1", "Authors1", "Reference1 Publication Month Date, Year",
"Abstract1", "Title2", "Authors2", "Reference2 Publication Month Date, Year",
"Abstract2", "Title3", "Authors3", "Reference3 Publication Month Date, Year",
"Title4", "Authors4", "Reference4 Publication Month Date, Year",
"Abstract1")), class = "data.frame", row.names = c(NA, -15L))
The third item doesn't have "Abstract" for it, and there is no NA also in its place.
So, I want to spread the data in different columns. The expected format would be
structure(list(Titles_Data = c("Title1", "Title2", "Title3",
"Title4"), Authors_Data = c("Authors1", "Authors2", "Authors3",
"Authors4"), Details_Data = c("Reference1 Publication Month Date, Year",
"Reference2 Publication Month Date, Year", "Reference3 Publication Month Date, Year",
"Reference4 Publication Month Date, Year"), Abstracts_Data = c("Abstract1",
"Abstract2", NA, "Abstract4")), class = "data.frame", row.names = c(NA,
-4L))
How can I spread this data in this situation? The real data is of much larger size, around 1,700 rows.
CodePudding user response:
You can try the following solution. I also modified my first one to account for the Abstract_data
values:
df %>%
mutate(title = gsub('^([A-Z][a-z] )(\\d )(.*)', '\\1_data', original_data),
id = cumsum(grepl('Title\\d ', original_data))) %>%
pivot_wider(id_cols = id,
names_from = title,
values_from = original_data) %>%
mutate(Abstract_data = ifelse(!is.na(Abstract_data), paste0(gsub('^([A-Z][a-z] )(\\d )', '\\1', Abstract_data), id),
Abstract_data))
# A tibble: 4 × 5
id Title_data Authors_data Reference_data Abstract_data
<int> <chr> <chr> <chr> <chr>
1 1 Title1 Authors1 Reference1 Publication Month Date, Year Abstract1
2 2 Title2 Authors2 Reference2 Publication Month Date, Year Abstract2
3 3 Title3 Authors3 Reference3 Publication Month Date, Year NA
4 4 Title4 Authors4 Reference4 Publication Month Date, Year Abstract4
CodePudding user response:
Here is another option. I imagine this might fail if the missing data is not the abstract:
library(tidyverse)
df |>
group_by(group = cumsum(grepl("Title", original_data))) |>
summarise(txt = paste(original_data, collapse = "---")) |>
separate(txt, into = c("Titles_Data", "Authors_Data", "Details_Data", "Abstracts_Data"),
sep = "---")
#> # A tibble: 4 x 5
#> group Titles_Data Authors_Data Details_Data Abstr~1
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 Title1 Authors1 Reference1 Publication Month Date, Year Abstra~
#> 2 2 Title2 Authors2 Reference2 Publication Month Date, Year Abstra~
#> 3 3 Title3 Authors3 Reference3 Publication Month Date, Year <NA>
#> 4 4 Title4 Authors4 Reference4 Publication Month Date, Year Abstra~
#> # ... with abbreviated variable name 1: Abstracts_Data
This might be more flexible:
library(tidyverse)
df |>
extract(col = original_data,
into = c("cat", "grp"),
regex = "(\\w )(\\d )",
remove = FALSE) |>
pivot_wider(names_from = cat, values_from = original_data)
#> # A tibble: 4 x 5
#> grp Title Authors Reference Abstract
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Title1 Authors1 Reference1 Publication Month Date, Year Abstract1
#> 2 2 Title2 Authors2 Reference2 Publication Month Date, Year Abstract2
#> 3 3 Title3 Authors3 Reference3 Publication Month Date, Year <NA>
#> 4 4 Title4 Authors4 Reference4 Publication Month Date, Year Abstract4