This is my dataframe:
df <- tibble(col1 = c("1. word","2. word","3. word","4. word","5. N. word","6. word","7. word","8. word"))
I need to split in two columns using separate function and rename them as Numbers
and other called Words
. Ive doing this but its not working:
df %>% separate(col = col1 , into = c('Number','Words'), sep = "^. ")
The problem is that the fifth has 2 dots. And I dont know how to handle with this regarding the regex.
Any help?
CodePudding user response:
I'm assuming that you would like to keep the cumbersome "N." in the result. For that, my advice is to use extract
instead of separate
:
df %>%
extract(
col = col1 ,
into = c('Number','Words'),
regex = "([0-9] )\\. (.*)")
The regular expression ([0-9] )\\. (.*)
means that you are looking first for a number, that you want to put in a first column, followed by a dot and a space (\\.
) that should be discarded, and the rest should go in a second column.
The result:
# A tibble: 8 × 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word
CodePudding user response:
A tidyverse
approach would be to first clean the data then separate.
df %>%
mutate(col1 = gsub("\\s.*(?=word)", "", col1, perl=TRUE)) %>%
tidyr::separate(col1, into = c("Number", "Words"), sep="\\.")
Result:
# A tibble: 8 x 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word
7 7 word
8 8 word
CodePudding user response:
Here is an alternative using readr
s parse_number
and a regex:
library(dplyr)
library(readr)
df %>%
mutate(Numbers = parse_number(col1), .before=1) %>%
mutate(col1 = gsub('\\d \\. ','',col1))
Numbers col1
<dbl> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
CodePudding user response:
I am not sure how to do this with tidyr
, but the following should work with base R
.
df$col1 <- gsub('N. ', '', df$col1)
df$Numbers <- as.numeric(sapply(strsplit(df$col1, ' '), '[', 1))
df$Words <- sapply(strsplit(df$col1, ' '), '[', 2)
df$col1 <- NULL
Result
> head(df)
Numbers Words
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word
CodePudding user response:
Try read.table
sub
> read.table(text = sub("\\.", ",", df$col1), sep = ",")
V1 V2
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word