Home > Blockchain >  Stripping select characters from the beginning and end of a string in R
Stripping select characters from the beginning and end of a string in R

Time:11-30

Sorry if this has been answered, but I couldn't quite seem to find an answer that addressed this particular issue. Here is a small sample of the data I'm using:

precinct_no,newsom_count,dahle_count,difference
0001-100000-SAN PASQUAL,5,18,-13
0002-100090-SAN PASQUAL,567,622,-55
0003-100120-SAN PASQUAL,0,0,0
0004-100150-SAN PASQUAL,0,0,0
0005-105000-RANCHO BERNARDO,572,538,34
0006-105040-RANCHO BERNARDO,609,582,27

In the precinct_no column, how can I strip everything except for the middle six digits? I don't want the four digits in the beginning, the town names at the end, or the dashes. Just those middle six digits. I need to do this for about 3,000 rows.

CodePudding user response:

If the format is consistent across your ~3000 rows, perhaps this will work:

df <- read.table(text = "precinct_no,newsom_count,dahle_count,difference
0001-100000-SAN PASQUAL,5,18,-13
0002-100090-SAN PASQUAL,567,622,-55
0003-100120-SAN PASQUAL,0,0,0
0004-100150-SAN PASQUAL,0,0,0
0005-105000-RANCHO BERNARDO,572,538,34
0006-105040-RANCHO BERNARDO,609,582,27",
header = TRUE, sep = ",")

df$precinct_no = gsub("[^-]*-(\\d{6})-.*", "\\1", df$precinct_no)
df
#>   precinct_no newsom_count dahle_count difference
#> 1      100000            5          18        -13
#> 2      100090          567         622        -55
#> 3      100120            0           0          0
#> 4      100150            0           0          0
#> 5      105000          572         538         34
#> 6      105040          609         582         27

Created on 2022-11-30 with reprex v2.0.2

Explanation:

"[^-]*-(\\d{6})-.*"

[^-]*- match one or more characters that aren't "-", then a single "-"

(\\d{6}) capture this 'middle' group (6 digits)

-.*" match a single "-", followed by any number of any characters

CodePudding user response:

I like extract for things like this. You can pull out just the middle or you can separate everything to keep all permanent info:

library(tidyverse)


#get the middle
extract(df, 
        precinct_no, 
        into = "precinct_no", 
        regex = ".*-(.*)-.*",
        convert = TRUE)
#>   precinct_no newsom_count dahle_count difference
#> 1      100000            5          18        -13
#> 2      100090          567         622        -55
#> 3      100120            0           0          0
#> 4      100150            0           0          0
#> 5      105000          572         538         34
#> 6      105040          609         582         27


#get everything
extract(df, 
        precinct_no, 
        into = c("ID", "precinct_no", "town"), 
        regex = "(.*)-(.*)-(.*)",
        convert = TRUE)
#>   ID precinct_no            town newsom_count dahle_count difference
#> 1  1      100000     SAN PASQUAL            5          18        -13
#> 2  2      100090     SAN PASQUAL          567         622        -55
#> 3  3      100120     SAN PASQUAL            0           0          0
#> 4  4      100150     SAN PASQUAL            0           0          0
#> 5  5      105000 RANCHO BERNARDO          572         538         34
#> 6  6      105040 RANCHO BERNARDO          609         582         27

or if you don't like regex and you just want to split on the "-":

library(tidyverse)

separate(df, 
         precinct_no , 
         into = c("ID", "precinct_no", "town"), 
         sep = "-", 
         convert = TRUE)
#>   ID precinct_no            town newsom_count dahle_count difference
#> 1  1      100000     SAN PASQUAL            5          18        -13
#> 2  2      100090     SAN PASQUAL          567         622        -55
#> 3  3      100120     SAN PASQUAL            0           0          0
#> 4  4      100150     SAN PASQUAL            0           0          0
#> 5  5      105000 RANCHO BERNARDO          572         538         34
#> 6  6      105040 RANCHO BERNARDO          609         582         27

or if you don't like regex and you want to stay in base R:

df$precinct_no <- unlist(Map(\(x)x[2], strsplit(df$precinct_no, "-")))
df
#>   precinct_no newsom_count dahle_count difference
#> 1      100000            5          18        -13
#> 2      100090          567         622        -55
#> 3      100120            0           0          0
#> 4      100150            0           0          0
#> 5      105000          572         538         34
#> 6      105040          609         582         27
  • Related