Home > Mobile >  R - Parse text string into multiple columns & extract data values
R - Parse text string into multiple columns & extract data values

Time:03-26

I have a large dataset in the form shown below:

ID Scores
1 English 3, French 7, Geography 8
2 Spanish 7, Classics 4
3 Physics 5, English 5, PE 7, Art 4

I need to parse the text string from the Scores column into separate columns for each subject with the scores for each individual stored as the data values, as below:

ID English French Geography Spanish Classics Physics PE Art
1 3 7 8 - - - - -
2 - - - 7 4 - - -
3 5 - - - - 5 7 4

I cannot manually predefine the columns as there are 100s in the full dataset. So far I have cleaned the data to remove inconsistent capitalisation and separated each subject-mark pairing into a distinct column as follows:

df$scores2 <- str_to_lower(df$Scores)
split <- separate(
  df,
  scores2,
  into = paste0("Subject", 1:8),
  sep = "\\,",
  remove = FALSE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
)

I have looked at multiple questions on the subject, such as Split irregular text column into multiple columns in r, but I cannot find another case where the column titles and data values are mixed in the text string. How can I generate the full set of columns required and then populate the data value?

CodePudding user response:

You can first strsplit the Scores column to split on subject-score pairs (which would be in a list), then unnest the list-column into rows. Then separate the subject-score pairs into Subject and Score columns. Finally transform the data from a "long" format to a "wide" format.

Thanks @G. Grothendieck for improving my code:)

library(tidyverse)

df %>% 
  separate_rows(Scores, sep = ", ") %>% 
  separate(Scores, sep = " ", into = c("Subject", "Score")) %>% 
  pivot_wider(names_from = "Subject", values_from = "Score")

# A tibble: 3 × 9
     ID English French Geography Spanish Classics Physics PE    Art  
  <int> <chr>   <chr>  <chr>     <chr>   <chr>    <chr>   <chr> <chr>
1     1 3       7      8         NA      NA       NA      NA    NA   
2     2 NA      NA     NA        7       4        NA      NA    NA   
3     3 5       NA     NA        NA      NA       5       7     4    

CodePudding user response:

Using data.table

library(data.table)

setDT(dt)

dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")

Results

#    ID  Art Classics English French Geography   PE Physics Spanish
# 1:  1 <NA>     <NA>       3      7         8 <NA>    <NA>    <NA>
# 2:  2 <NA>        4    <NA>   <NA>      <NA> <NA>    <NA>       7
# 3:  3    4     <NA>       5   <NA>      <NA>    7       5    <NA>

Data

dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8", 
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA, 
-3L), class = c("data.frame"))
  • Related