Home > other >  Regex help in R - split string after each occurrence of digit (dot) digit
Regex help in R - split string after each occurrence of digit (dot) digit

Time:11-04

I have the following sample dataframe:

dat <- data.frame(date= c("Sep2020", "Oct2020", "Nov2020", "Dec2020"), 
                  txt= c("1.1 What is the Constitution?     1.2 The original charter, which replaced the Articles of Confederation      1.3 hat all States would be equal.  ", 
                         "4.4 What is the Bill of Rights?      4.5    The 9th and 10th amendments are general ",
                         "5.1  in criminal prosecution to a speedy and public    5.2  War, three amendments were ratified (1865  5.3   13. The most recent amendment, the 27th, was",
                         "6.2  the case of the proposed equal rights amendment, the Congress exten      6.3     but the proposed Amendment was never ratifie          6.4  tification deadline. The 38th State, Michig"))

I want to split the dataframe so that after each digit (dot) digit, a new row is created. The final dataframe would look like this :

dat2 <-data.frame(date= c("Sep2020", "Sep2020", "Sep2020", "Oct2020", "Oct2020", "Nov2020", "Nov2020", "Nov2020", "Dec2020", "Dec2020", "Dec2020"), 
                txt= c("1.1 What is the Constitution?","1.2 The original charter, which replaced the Articles of Confederation","1.3 hat all States would be equal.  ", 
                       "4.4 What is the Bill of Rights?",      "4.5    The 9th and 10th amendments are general ",
                       "5.1  in criminal prosecution to a speedy and public",    "5.2  War, three amendments were ratified (1865",  "5.3   13. The most recent amendment, the 27th, was",
                       "6.2  the case of the proposed equal rights amendment, the Congress exten", "6.3     but the proposed Amendment was never ratifie", "6.4  tification deadline. The 38th State, Michig"))

This is what I have so far:

dat<-dat %>% 
  mutate(parsed= str_extract_all(txt, "(\\d{1}\\.\\d{1,2})")) %>% 
  unnest(parsed) 

I'm able to get the digits, but not the text between them. I'm a beginner with regular expressions, and can't quite figure out how to say that I want everything between 1.1 and 1.2, for example.

Thanks!

CodePudding user response:

We may use separate_rows

library(tidyr)
library(dplyr)
dat %>% 
    separate_rows(txt, sep = "\\s (?=\\d \\.\\d )")

CodePudding user response:

library(dplyr)
library(tidyr)
library(stringr)
dat %>%
  mutate(parsed = stringr::str_extract_all(txt, ".*?[^0-9](?=$|[0-9]{1}\\.[0-9]{1})")) %>%
  select(-txt) %>%
  unnest(parsed) %>%
  mutate(parsed = trimws(parsed))
# # A tibble: 11 x 2
#    date    parsed                                                                  
#    <chr>   <chr>                                                                   
#  1 Sep2020 1.1 What is the Constitution?                                           
#  2 Sep2020 1.2 The original charter, which replaced the Articles of Confederation  
#  3 Sep2020 1.3 hat all States would be equal.                                      
#  4 Oct2020 4.4 What is the Bill of Rights?                                         
#  5 Oct2020 4.5    The 9th and 10th amendments are general                          
#  6 Nov2020 5.1  in criminal prosecution to a speedy and public                     
#  7 Nov2020 5.2  War, three amendments were ratified (1865                          
#  8 Nov2020 5.3   13. The most recent amendment, the 27th, was                      
#  9 Dec2020 6.2  the case of the proposed equal rights amendment, the Congress exten
# 10 Dec2020 6.3     but the proposed Amendment was never ratifie                    
# 11 Dec2020 6.4  tification deadline. The 38th State, Michig                        

I'm using ".*?[^0-9](?=$|[0-9]{1}\\.[0-9]{1})" because that's ({1}) closest to what you used, though I wonder if that's over-constraining. Unless you know that you will never see (e.g.) 1.10, then I'd generally prefer something like ".*?[^0-9](?=$|[0-9] \\.[0-9] )".

  • Related