Home > OS >  R using regular expressions to replace nth occurrence of white space in strings
R using regular expressions to replace nth occurrence of white space in strings

Time:10-21

I'm a beginner with regular expressions and am processing data from a pdf and with R. Unfortunately R did not capture the decimal points in the data so I need to replace specific occurrences of white space with periods. I was able to figure out an admittedly heavy handed solution to this problem but suspect there is a much more efficient way to accomplish this with regular expressions. Below is some example data, the solution I am using, and the resulting data set I want to get.

ex<-c("16 7978 38 78 651 42 651 42","25 1967 8 94 225 26 225 26",
      "16 5000 8 00 132 00 132 00", "16 6125 2 00 33 23 33 23")

df<-data.frame(row=1:4,string=ex)
df

row                   string
 1  16 7978 38 78 651 42 651 42
 2  25 1967 8 94 225 26 225 26
 3  16 5000 8 00 132 00 132 00
 4  16 6125 2 00 33 23 33 23

df$tst<-stri_replace_first_fixed(df$string," ",".")
df$rate<-sapply(strsplit(df$tst," ",fixed=T),"[",1)
df$string2<-stri_replace_first_fixed(df$tst,df$rate,"")%>% trimws()
df$tst<-stri_replace_first_fixed(df$string2," ",".")
df$hours<-sapply(strsplit(df$tst," ",fixed=T),"[",1)
df$string3<-stri_replace_first_fixed(df$tst,df$hours,"")%>% trimws()
df$tst<-stri_replace_first_fixed(df$string3," ",".")
df$period_amt<-sapply(strsplit(df$tst," ",fixed=T),"[",1)
df$string4<-stri_replace_first_fixed(df$tst,df$period_amt,"")%>% trimws()
df$tst<-stri_replace_first_fixed(df$string3," ",".")
df$ytd_amt<-sapply(strsplit(df$tst," ",fixed=T),"[",1)


df<-df %>% dplyr::select(-string2,-string3,-tst,-string4)
df
 
  row                     string   rate  hours  period_amt ytd_amt
   1 16 7978 38 78 651 42 651 42 16.7978 38.78     651.42  651.42
   2  25 1967 8 94 225 26 225 26 25.1967  8.94     225.26  225.26
   3  16 5000 8 00 132 00 132 00 16.5000  8.00     132.00  132.00
   4    16 6125 2 00 33 23 33 23 16.6125  2.00      33.23   33.23

In the solution above I replace the first occurrence of white space with a period, extract the corrected number, then remove the corrected number from the string. This processing is then repeated iteratively until all values have been extracted. As I said the solution works but it seems pretty sloppy to me and would be tedious if I needed to correct and extract a large number of values from the text. Any suggestions on better ways to accomplish this in R would be greatly appreciated.

Thanks in advance!

CodePudding user response:

Here is a way. As @Peter says, it may be better addresing the cause of missing decimal points, perhaps it has to do with the pdf source.

Also it will depends of the source number format, for example if it has leading zeros, negative sign, etc.

ex<-c("16 7978 38 78 651 42 651 42","25 1967 8 94 225 26 225 26",
      "16 5000 8 00 132 00 132 00", "16 6125 2 00 33 23 33 23")

df<-data.frame(row=1:4,string=ex)

n <- do.call(rbind, strsplit(gsub("(\\d ) (\\d )","\\1.\\2",df$string ), " " ))
n <- apply(n, 2, as.numeric)
colnames(n) <- c("rate",  "hours",  "period_amt", "ytd_amt")
df<-cbind(df, n)
df
#>   row                      string    rate hours period_amt ytd_amt
#> 1   1 16 7978 38 78 651 42 651 42 16.7978 38.78     651.42  651.42
#> 2   2  25 1967 8 94 225 26 225 26 25.1967  8.94     225.26  225.26
#> 3   3  16 5000 8 00 132 00 132 00 16.5000  8.00     132.00  132.00
#> 4   4    16 6125 2 00 33 23 33 23 16.6125  2.00      33.23   33.23

CodePudding user response:

You could do the whole thing with

do.call(rbind,
        lapply(strsplit(gsub("(\\d ) (\\d )", "\\1.\\2", ex), " "), 
               as.numeric))|>
  as.data.frame() |>
  setNames(c("rate", "hours", "period_amt", "ytd_amt"))
#>      rate hours period_amt ytd_amt
#> 1 16.7978 38.78     651.42  651.42
#> 2 25.1967  8.94     225.26  225.26
#> 3 16.5000  8.00     132.00  132.00
#> 4 16.6125  2.00      33.23   33.23

Created on 2022-10-20 with reprex v2.0.2

  • Related