Desired Output
Need the following output
df2 <-
data.frame(
v1 = c(1100001, 1100002, 1100003, 1100004, 1100005)
, v2 = c("A R", "W R", "A K", "M", "A C")
, v3 = c("P", "G P", "G P", "P", "P")
, v4 = c(110, 161, 129, 132, "Absent")
, v5 = c(55, 80.5, 64.5, 66, "Absent")
)
df2
v1 v2 v3 v4 v5
1 1100001 A R P 110 55
2 1100002 W R G P 161 80.5
3 1100003 A K G P 129 64.5
4 1100004 M P 132 66
5 1100005 A C P Absent Absent
Here is my original data.frame
.
df1 <-
structure(list(value = c(
"1100001 A R P 110 55",
"1100002 W R G P 161 80.5",
"1100003 A K G P 129 64.5",
"1100004 M P 132 66",
"1100005 A C P Absent Absent"
)), row.names = c(NA, -5L), class = c("data.frame")
)
df1
value
1 1100001 A R P 110 55
2 1100002 W R G P 161 80.5
3 1100003 A K G P 129 64.5
4 1100004 M P 132 66
5 1100005 A C P Absent Absent
Used cSplit
function from splitstackshape
, but could not manage to get the required output. Any points.
library(splitstackshape)
cSplit(indt = df1, splitCols = "value", sep = " ")
value_1 value_2 value_3 value_4 value_5 value_6 value_7
1: 1100001 A R P 110 55 NA
2: 1100002 W R G P 161 80.5
3: 1100003 A K G P 129 64.5
4: 1100004 M P 132 66 NA NA
5: 1100005 A P Absent Absent NA NA
CodePudding user response:
We assume that a new field starts after two or more spaces or one space followed by a digit where the digit is the start of the next field. Replace such separators with a comma and then use read.table
with sep=","
df1$value |>
gsub(pattern = " | (?=\\d)", replacement = ",", perl = TRUE) |>
textConnection(name = "") |>
read.table(sep = ",")
giving this data.frame:
V1 V2 V3 V4 V5
1 1100001 A R P 110 55
2 1100002 W R G P 161 80.5
3 1100003 A K G P 129 64.5
4 1100004 M P 132 66
5 1100005 A P Absent Absent
Session Log
> df1 <-
structure(list(value = c(
"1100001 A R P 110 55",
"1100002 W R G P 161 80.5",
"1100003 A K G P 129 64.5",
"1100004 M P 132 66",
"1100005 A C P Absent Absent"
)), row.names = c(NA, -5L), class = c("data.frame")
)
>
> df2 <-
data.frame(
v1 = c(1100001, 1100002, 1100003, 1100004, 1100005)
, v2 = c("A R", "W R", "A K", "M", "A C")
, v3 = c("P", "G P", "G P", "P", "P")
, v4 = c(110, 161, 129, 132, "Absent")
, v5 = c(55, 80.5, 64.5, 66, "Absent")
)
>
> df2a <- df1$value |>
gsub(pattern = " | (?=\\d)", replacement = ",", perl = TRUE) |>
textConnection(name = "") |>
read.table(sep = ",")
>
> all(df2 == df2a)
[1] TRUE
CodePudding user response:
1.Assuming the rule for separation are: a) more than one space b) numbers follwing a letter and space are to be split
2.We make sure all soon to be varibles are separated by more than one space (using gsub
to replace one space between a letter and a number with two spaces)
3.Then we use tidyr::separate
to separate the string into variables using two or more spaces as deliminiator
library(dplyr)
library(tidyr)
df1 %>%
mutate(value = gsub("([A-z])( )([0-9])", "\\1 \\3", value)) %>%
separate(value, c(paste0("v", 1:5)),"[ ]{2,}")
Returns:
v1 v2 v3 v4 v5
1 1100001 A R P 110 55
2 1100002 W R G P 161 80.5
3 1100003 A K G P 129 64.5
4 1100004 M P 132 66
5 1100005 A P Absent Absent
Edit
Regarding the new constraint that was not included in the original example (only one space between a number and a letter):
Suggested solution:
We just repeat the command for adding an extra space with "reversed" regex. So that any single space between a number and a letter will get an extra space and will in then be separated by the separate
call
df1 %>%
mutate(value = gsub("([A-z])( )([0-9])", "\\1 \\3", value)) %>%
mutate(value = gsub("([0-9])( )([A-z])", "\\1 \\3", value)) %>%
separate(value, c(paste0("v", 1:5)),"[ ]{2,}")