Split columns considering only the first dot in R using separate-CodePudding

This is my dataframe:

df <- tibble(col1 = c("1. word","2. word","3. word","4. word","5. N. word","6. word","7. word","8. word"))

I need to split in two columns using separate function and rename them as Numbers and other called Words. Ive doing this but its not working:

df %>% separate(col = col1 , into = c('Number','Words'), sep = "^. ")

The problem is that the fifth has 2 dots. And I dont know how to handle with this regarding the regex.

Any help?

CodePudding user response：

I'm assuming that you would like to keep the cumbersome "N." in the result. For that, my advice is to use extract instead of separate:

df %>% 
  extract(
    col = col1 ,
    into = c('Number','Words'), 
    regex = "([0-9] )\\. (.*)")

The regular expression ([0-9] )\\. (.*) means that you are looking first for a number, that you want to put in a first column, followed by a dot and a space (\\. ) that should be discarded, and the rest should go in a second column.

The result:

# A tibble: 8 × 2
  Number Words  
  <chr>  <chr>  
1 1      word   
2 2      word   
3 3      word   
4 4      word   
5 5      N. word
6 6      word   
7 7      word   
8 8      word

CodePudding user response：

A tidyverse approach would be to first clean the data then separate.

 df %>% 
      mutate(col1 = gsub("\\s.*(?=word)", "", col1, perl=TRUE)) %>% 
      tidyr::separate(col1, into = c("Number", "Words"), sep="\\.")

Result:

# A tibble: 8 x 2
  Number Words
  <chr>  <chr>
1 1      word 
2 2      word 
3 3      word 
4 4      word 
5 5      word 
6 6      word 
7 7      word 
8 8      word

CodePudding user response：

Here is an alternative using readrs parse_number and a regex:

library(dplyr)
library(readr)
df %>% 
  mutate(Numbers = parse_number(col1), .before=1) %>% 
  mutate(col1 = gsub('\\d \\. ','',col1))

  Numbers col1   
    <dbl> <chr>  
1       1 word   
2       2 word   
3       3 word   
4       4 word   
5       5 N. word
6       6 word   
7       7 word

CodePudding user response：

I am not sure how to do this with tidyr, but the following should work with base R.

df$col1 <- gsub('N. ', '', df$col1)
df$Numbers <- as.numeric(sapply(strsplit(df$col1, ' '), '[', 1))
df$Words <- sapply(strsplit(df$col1, ' '), '[', 2)
df$col1 <- NULL

Result

> head(df)
  Numbers Words
1       1  word
2       2  word
3       3  word
4       4  word
5       5  word
6       6  word

CodePudding user response：

Try read.table sub

> read.table(text = sub("\\.", ",", df$col1), sep = ",")
  V1       V2
1  1     word
2  2     word
3  3     word
4  4     word
5  5  N. word
6  6     word
7  7     word
8  8     word