I need to copy numbers between two text strings in column, "transpose" it to columns next-CodePudding

I need to copy numbers between two text strings in column, "transpose" it to columns next to the first text string and repeat through the rest of the column. Dataset has hundreds of rows.

The "alignment" (like putting NAs for correcting positions of numbers) is not necessary, transposing would be more than enough.

input:

df1 <- structure(list(column1 = c("STOUT", "18", "9341", "4", "0,2005", 
"STOUT", "1", "9341", "25", "0,2004", "STIN", "7", "9341", "0,2003", 
"OFF", "7", "L(1)", "9342", "0,2005")), class = "data.frame", row.names = c(NA, 
-19L))

> print(df)
       column1
    1    STOUT
    2       18
    3     9341
    4        4
    5   0,2005
    6    STOUT
    7        1
    8     9341
    9       25
    10  0,2004
    11    STIN
    12       7
    13    9341
    14  0,2003
    15     OFF
    16       7
    17    L(1)
    18    9342
    19  0,2005

desired output:

df2 <- structure(list(column1 = c("STOUT", "STOUT", "STIN", "OFF", "L(1)"
), column2 = c(18L, 1L, 7L, 7L, NA), column3 = c(9341L, 9341L, 
9341L, NA, 9342L), column4 = c(4L, 25L, NA, NA, NA), column5 = c(0.2005, 
0.2004, 0.2003, NA, 0.2005)), class = "data.frame", row.names = c(NA, 
-5L))

> print(df2)
  column1 column2 column3 column4 column5
1   STOUT      18    9341       4  0.2005
2   STOUT       1    9341      25  0.2004
3    STIN       7    9341      NA  0.2003
4     OFF       7      NA      NA      NA
5    L(1)      NA    9342      NA  0.2005

I was thinking along Extracting a string between other two strings in R

But did not make much progress :-/

Thanks in advance.

CodePudding user response：

Split on a regex-based logical:

(I defined the break points based on the presence of capital letters (i.e. [A-Z]); you may want to modify the pattern based on your expected break points in df1$column1.)

a <- split(df1$column1, cumsum(grepl('[A-Z]', df1$column1)))
a
$`1`
[1] "STOUT"  "18"     "9341"   "4"      "0,2005"

$`2`
[1] "STOUT"  "1"      "9341"   "25"     "0,2004"

$`3`
[1] "STIN"   "7"      "9341"   "0,2003"

$`4`
[1] "OFF" "7"  

$`5`
[1] "L(1)"   "9342"   "0,2005"

Then rbind() and fill with NA:

(plyr::rbind.fill() expects a dataframe, so I'm using lapply() to call as.data.frame() to each list element.)

library(plyr)

plyr::rbind.fill(lapply(a,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
     V1   V2     V3     V4     V5
1 STOUT   18   9341      4 0,2005
2 STOUT    1   9341     25 0,2004
3  STIN    7   9341 0,2003   <NA>
4   OFF    7   <NA>   <NA>   <NA>
5  L(1) 9342 0,2005   <NA>   <NA>

CodePudding user response：

Here is a base R option using read.csv by grepl to make it

df2 <- with(
    df1,
    read.csv(
        text = paste0(
            by(column1, cumsum(grepl("^\\D", column1)), toString),
            collapse = "\n"
        ),
        header = FALSE
    )
)

which gives

> df2
     V1   V2   V3   V4   V5   V6
1 STOUT   18 9341    4    0 2005
2 STOUT    1 9341   25    0 2004
3  STIN    7 9341    0 2003   NA
4   OFF    7   NA   NA   NA   NA
5  L(1) 9342    0 2005   NA   NA

> str(df2)
'data.frame':   5 obs. of  6 variables:
 $ V1: chr  "STOUT" "STOUT" "STIN" "OFF" ...
 $ V2: int  18 1 7 7 9342
 $ V3: int  9341 9341 9341 NA 0
 $ V4: int  4 25 0 NA 2005
 $ V5: int  0 0 2003 NA NA
 $ V6: int  2005 2004 NA NA NA

CodePudding user response：

Another possible solution, based on tidyverse:

library(tidyverse)

df1 %>% 
  group_by(aux = cumsum(str_detect(column1, "[^\\d|,]"))) %>% 
  mutate(x = rep(first(column1), n())) %>%
  filter(!str_detect(column1, "[^\\d|,]")) %>%
  mutate(name = paste0("column", 2:(n() 1))) %>% 
  ungroup %>% 
  pivot_wider(c(aux, x), values_from = column1) %>% 
  select(-aux, column1 = x)

#> # A tibble: 5 × 5
#>   column1 column2 column3 column4 column5
#>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 STOUT   18      9341    4       0,2005 
#> 2 STOUT   1       9341    25      0,2004 
#> 3 STIN    7       9341    0,2003  <NA>   
#> 4 OFF     7       <NA>    <NA>    <NA>   
#> 5 L(1)    9342    0,2005  <NA>    <NA>