Home > Net >  Transforming string column to specific data.frame
Transforming string column to specific data.frame

Time:10-02

Desired Output

Need the following output

df2 <-
  data.frame(
    v1 = c(1100001, 1100002, 1100003, 1100004, 1100005)
  , v2 = c("A R", "W R", "A K", "M", "A C")
  , v3 = c("P", "G P", "G P", "P", "P")
  , v4 = c(110, 161, 129, 132, "Absent")
  , v5 = c(55, 80.5, 64.5, 66,  "Absent")
    )
df2

       v1  v2  v3     v4     v5
1 1100001 A R   P    110     55
2 1100002 W R G P    161   80.5
3 1100003 A K G P    129   64.5
4 1100004   M   P    132     66
5 1100005 A C   P Absent Absent

Here is my original data.frame.

df1 <-
  structure(list(value = c(
"1100001     A R                P             110    55", 
"1100002     W R                 G P 161    80.5", 
"1100003     A K                  G P 129    64.5", 
"1100004     M                      P             132    66",
"1100005     A C                     P             Absent    Absent"
)), row.names = c(NA, -5L), class = c("data.frame")
)

df1

                                                              value
1            1100001     A R                P             110    55
2                   1100002     W R                 G P 161    80.5
3                  1100003     A K                  G P 129    64.5
4        1100004     M                      P             132    66
5 1100005     A C                     P             Absent    Absent

Used cSplit function from splitstackshape, but could not manage to get the required output. Any points.

library(splitstackshape)
cSplit(indt = df1, splitCols = "value", sep = " ")

   value_1 value_2 value_3 value_4 value_5 value_6 value_7
1: 1100001       A       R       P     110      55      NA
2: 1100002       W       R       G       P     161    80.5
3: 1100003       A       K       G       P     129    64.5
4: 1100004       M       P     132      66      NA      NA
5: 1100005       A       P  Absent  Absent      NA      NA

CodePudding user response:

We assume that a new field starts after two or more spaces or one space followed by a digit where the digit is the start of the next field. Replace such separators with a comma and then use read.table with sep=","

df1$value |>
  gsub(pattern = "   | (?=\\d)", replacement = ",", perl = TRUE) |>
  textConnection(name = "") |>
  read.table(sep = ",")

giving this data.frame:

       V1  V2  V3     V4     V5
1 1100001 A R   P    110     55
2 1100002 W R G P    161   80.5
3 1100003 A K G P    129   64.5
4 1100004   M   P    132     66
5 1100005   A   P Absent Absent

Session Log

> df1 <-
    structure(list(value = c(
  "1100001     A R                P             110    55", 
  "1100002     W R                 G P 161    80.5", 
  "1100003     A K                  G P 129    64.5", 
  "1100004     M                      P             132    66",
  "1100005     A C                     P             Absent    Absent"
  )), row.names = c(NA, -5L), class = c("data.frame")
  )
> 
> df2 <-
    data.frame(
      v1 = c(1100001, 1100002, 1100003, 1100004, 1100005)
    , v2 = c("A R", "W R", "A K", "M", "A C")
    , v3 = c("P", "G P", "G P", "P", "P")
    , v4 = c(110, 161, 129, 132, "Absent")
    , v5 = c(55, 80.5, 64.5, 66,  "Absent")
      )
> 
> df2a <- df1$value |>
    gsub(pattern = "   | (?=\\d)", replacement = ",", perl = TRUE) |>
    textConnection(name = "") |>
    read.table(sep = ",")
> 
> all(df2 == df2a)
[1] TRUE

CodePudding user response:

1.Assuming the rule for separation are: a) more than one space b) numbers follwing a letter and space are to be split

2.We make sure all soon to be varibles are separated by more than one space (using gsub to replace one space between a letter and a number with two spaces)

3.Then we use tidyr::separate to separate the string into variables using two or more spaces as deliminiator

library(dplyr)
library(tidyr)
df1 %>% 
  mutate(value = gsub("([A-z])( )([0-9])", "\\1  \\3", value)) %>% 
  separate(value, c(paste0("v", 1:5)),"[ ]{2,}")

Returns:

       v1  v2  v3     v4     v5
1 1100001 A R   P    110     55
2 1100002 W R G P    161   80.5
3 1100003 A K G P    129   64.5
4 1100004   M   P    132     66
5 1100005   A   P Absent Absent

Edit

Regarding the new constraint that was not included in the original example (only one space between a number and a letter):

Suggested solution:

We just repeat the command for adding an extra space with "reversed" regex. So that any single space between a number and a letter will get an extra space and will in then be separated by the separate call

df1 %>% 
  mutate(value = gsub("([A-z])( )([0-9])", "\\1  \\3", value)) %>% 
  mutate(value = gsub("([0-9])( )([A-z])", "\\1  \\3", value)) %>% 
  separate(value, c(paste0("v", 1:5)),"[ ]{2,}")
  • Related