My apologies for the bit hard to understand title of this question, I will try to explain what I mean with some example data. I have a text column in a data frame with rows that look like this:
2,413 European ancestry cases, 2,392 European ancestry controls, 810 African American
2,731 European ancestry cases, 10,747 European ancestry controls
8,918 European ancestry individuals, 3,947 Indian Asian ancestry individuals
175 Han Chinese ancestry cases, 175 Han Chinese ancestry controls
Preferably, I would like to turn this single column into multiple numerical columns, and have the column names be the words that follow every number. So the results from the four rows above would be:
European ancestry cases, European ancestry controls, African American, European ancestry individuals, Indian Asian ancestry individuals, Han Chinese ancestry cases, Han Chinese ancestry controls
2413, 2392, 810 , NA, NA, NA, NA
2731, 10747, NA, NA, NA, NA, NA
NA, NA, NA, 8918, 3947, NA, NA
NA, NA, NA, NA, NA, 175, 175
Any ideas on how to accomplish this in R?
CodePudding user response:
Here is one approach. First I use tstrplit
to split on ,
, then I melt
long, remove the commas from the leading number, split the leading number and the text into separate columns (called num
and value
), and dcast
back to wide format.
library(stringr)
library(data.table)
df = setDT(df)
dcast(
melt(df[,tstrsplit(s,", ")][, i:=.I], id="i")[
, (c("value","num")):={value=gsub(',','',value);num=str_extract(value,"^\\d ");value=gsub("^\\d ","",value);list(value,num)}][
!is.na(num)],
i~value, value.var="num"
)
Output:
i African American European ancestry cases European ancestry controls European ancestry individuals Han Chinese ancestry cases
1: 1 810 2413 2392 <NA> <NA>
2: 2 <NA> 2731 10747 <NA> <NA>
3: 3 <NA> <NA> <NA> 8918 <NA>
4: 4 <NA> <NA> <NA> <NA> 175
Han Chinese ancestry controls Indian Asian ancestry individuals
1: <NA> <NA>
2: <NA> <NA>
3: <NA> 3947
4: 175 <NA>
Input:
structure(list(s = c("2,413 European ancestry cases, 2,392 European ancestry controls, 810 African American",
"2,731 European ancestry cases, 10,747 European ancestry controls",
"8,918 European ancestry individuals, 3,947 Indian Asian ancestry individuals",
"175 Han Chinese ancestry cases, 175 Han Chinese ancestry controls"
)), class = "data.frame", row.names = c(NA, -4L))
CodePudding user response:
str_remove_all()
removes the commas in a number to avoid confusing with the comma-delimiters.separate_rows()
separates the collapsed pairs of names and values into multiple rows.extract()
separates names and values into respective columns.
library(tidyverse)
df %>%
mutate(id = 1:n(), txt = str_remove_all(txt, '(?<=\\d),(?=\\d)')) %>%
separate_rows(txt, sep = ',') %>%
extract(txt, c('val', 'col'), regex = "(\\d )\\s (. )", convert = TRUE) %>%
pivot_wider(names_from = col, values_from = val)
# # A tibble: 4 × 8
# id `European ancestry cases` `European ance…` `African Ameri…` `European ance…` `Indian Asian …` `Han Chinese a…` `Han Chinese a…`
# <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 2413 2392 810 NA NA NA NA
# 2 2 2731 10747 NA NA NA NA NA
# 3 3 NA NA NA 8918 3947 NA NA
# 4 4 NA NA NA NA NA 175 175
Data
df <- data.frame(
txt = c("2,413 European ancestry cases, 2,392 European ancestry controls, 810 African American",
"2,731 European ancestry cases, 10,747 European ancestry controls",
"8,918 European ancestry individuals, 3,947 Indian Asian ancestry individuals",
"175 Han Chinese ancestry cases, 175 Han Chinese ancestry controls")
)