Split column into the numbers it contains with the words that follow as column names-CodePudding

My apologies for the bit hard to understand title of this question, I will try to explain what I mean with some example data. I have a text column in a data frame with rows that look like this:

2,413 European ancestry cases, 2,392 European ancestry controls, 810 African American
2,731 European ancestry cases, 10,747 European ancestry controls
8,918 European ancestry individuals, 3,947 Indian Asian ancestry individuals
175 Han Chinese ancestry cases, 175 Han Chinese ancestry controls

Preferably, I would like to turn this single column into multiple numerical columns, and have the column names be the words that follow every number. So the results from the four rows above would be:

European ancestry cases, European ancestry controls, African American, European ancestry individuals, Indian Asian ancestry individuals, Han Chinese ancestry cases, Han Chinese ancestry controls
2413, 2392, 810 , NA, NA, NA, NA
2731, 10747, NA, NA, NA, NA, NA
NA, NA, NA, 8918, 3947, NA, NA
NA, NA, NA, NA, NA, 175, 175

Any ideas on how to accomplish this in R?

CodePudding user response：

Here is one approach. First I use tstrplit to split on , , then I melt long, remove the commas from the leading number, split the leading number and the text into separate columns (called num and value), and dcast back to wide format.

library(stringr)
library(data.table)

df = setDT(df)
dcast(
  melt(df[,tstrsplit(s,", ")][, i:=.I], id="i")[
    , (c("value","num")):={value=gsub(',','',value);num=str_extract(value,"^\\d ");value=gsub("^\\d  ","",value);list(value,num)}][
      !is.na(num)],
  i~value, value.var="num"
)

Output:

   i African American European ancestry cases European ancestry controls European ancestry individuals Han Chinese ancestry cases
1: 1              810                    2413                       2392                          <NA>                       <NA>
2: 2             <NA>                    2731                      10747                          <NA>                       <NA>
3: 3             <NA>                    <NA>                       <NA>                          8918                       <NA>
4: 4             <NA>                    <NA>                       <NA>                          <NA>                        175
   Han Chinese ancestry controls Indian Asian ancestry individuals
1:                          <NA>                              <NA>
2:                          <NA>                              <NA>
3:                          <NA>                              3947
4:                           175                              <NA>

Input:

structure(list(s = c("2,413 European ancestry cases, 2,392 European ancestry controls, 810 African American", 
"2,731 European ancestry cases, 10,747 European ancestry controls", 
"8,918 European ancestry individuals, 3,947 Indian Asian ancestry individuals", 
"175 Han Chinese ancestry cases, 175 Han Chinese ancestry controls"
)), class = "data.frame", row.names = c(NA, -4L))

CodePudding user response：

str_remove_all() removes the commas in a number to avoid confusing with the comma-delimiters.
separate_rows() separates the collapsed pairs of names and values into multiple rows.
extract() separates names and values into respective columns.

library(tidyverse)

df %>%
  mutate(id = 1:n(), txt = str_remove_all(txt, '(?<=\\d),(?=\\d)')) %>%
  separate_rows(txt, sep = ',') %>%
  extract(txt, c('val', 'col'), regex = "(\\d )\\s (. )", convert = TRUE) %>%
  pivot_wider(names_from = col, values_from = val)

# # A tibble: 4 × 8
#      id `European ancestry cases` `European ance…` `African Ameri…` `European ance…` `Indian Asian …` `Han Chinese a…` `Han Chinese a…`
#   <int>                     <int>            <int>            <int>            <int>            <int>            <int>            <int>
# 1     1                      2413             2392              810               NA               NA               NA               NA
# 2     2                      2731            10747               NA               NA               NA               NA               NA
# 3     3                        NA               NA               NA             8918             3947               NA               NA
# 4     4                        NA               NA               NA               NA               NA              175              175

Data

df <- data.frame(
  txt = c("2,413 European ancestry cases, 2,392 European ancestry controls, 810 African American",
          "2,731 European ancestry cases, 10,747 European ancestry controls",
          "8,918 European ancestry individuals, 3,947 Indian Asian ancestry individuals",
          "175 Han Chinese ancestry cases, 175 Han Chinese ancestry controls")
)