Sorry if this has been answered, but I couldn't quite seem to find an answer that addressed this particular issue. Here is a small sample of the data I'm using:
precinct_no,newsom_count,dahle_count,difference
0001-100000-SAN PASQUAL,5,18,-13
0002-100090-SAN PASQUAL,567,622,-55
0003-100120-SAN PASQUAL,0,0,0
0004-100150-SAN PASQUAL,0,0,0
0005-105000-RANCHO BERNARDO,572,538,34
0006-105040-RANCHO BERNARDO,609,582,27
In the precinct_no column, how can I strip everything except for the middle six digits? I don't want the four digits in the beginning, the town names at the end, or the dashes. Just those middle six digits. I need to do this for about 3,000 rows.
CodePudding user response:
If the format is consistent across your ~3000 rows, perhaps this will work:
df <- read.table(text = "precinct_no,newsom_count,dahle_count,difference
0001-100000-SAN PASQUAL,5,18,-13
0002-100090-SAN PASQUAL,567,622,-55
0003-100120-SAN PASQUAL,0,0,0
0004-100150-SAN PASQUAL,0,0,0
0005-105000-RANCHO BERNARDO,572,538,34
0006-105040-RANCHO BERNARDO,609,582,27",
header = TRUE, sep = ",")
df$precinct_no = gsub("[^-]*-(\\d{6})-.*", "\\1", df$precinct_no)
df
#> precinct_no newsom_count dahle_count difference
#> 1 100000 5 18 -13
#> 2 100090 567 622 -55
#> 3 100120 0 0 0
#> 4 100150 0 0 0
#> 5 105000 572 538 34
#> 6 105040 609 582 27
Created on 2022-11-30 with reprex v2.0.2
Explanation:
"[^-]*-(\\d{6})-.*"
[^-]*-
match one or more characters that aren't "-", then a single "-"
(\\d{6})
capture this 'middle' group (6 digits)
-.*"
match a single "-", followed by any number of any characters
CodePudding user response:
I like extract
for things like this. You can pull out just the middle or you can separate everything to keep all permanent info:
library(tidyverse)
#get the middle
extract(df,
precinct_no,
into = "precinct_no",
regex = ".*-(.*)-.*",
convert = TRUE)
#> precinct_no newsom_count dahle_count difference
#> 1 100000 5 18 -13
#> 2 100090 567 622 -55
#> 3 100120 0 0 0
#> 4 100150 0 0 0
#> 5 105000 572 538 34
#> 6 105040 609 582 27
#get everything
extract(df,
precinct_no,
into = c("ID", "precinct_no", "town"),
regex = "(.*)-(.*)-(.*)",
convert = TRUE)
#> ID precinct_no town newsom_count dahle_count difference
#> 1 1 100000 SAN PASQUAL 5 18 -13
#> 2 2 100090 SAN PASQUAL 567 622 -55
#> 3 3 100120 SAN PASQUAL 0 0 0
#> 4 4 100150 SAN PASQUAL 0 0 0
#> 5 5 105000 RANCHO BERNARDO 572 538 34
#> 6 6 105040 RANCHO BERNARDO 609 582 27
or if you don't like regex and you just want to split on the "-":
library(tidyverse)
separate(df,
precinct_no ,
into = c("ID", "precinct_no", "town"),
sep = "-",
convert = TRUE)
#> ID precinct_no town newsom_count dahle_count difference
#> 1 1 100000 SAN PASQUAL 5 18 -13
#> 2 2 100090 SAN PASQUAL 567 622 -55
#> 3 3 100120 SAN PASQUAL 0 0 0
#> 4 4 100150 SAN PASQUAL 0 0 0
#> 5 5 105000 RANCHO BERNARDO 572 538 34
#> 6 6 105040 RANCHO BERNARDO 609 582 27
or if you don't like regex and you want to stay in base R:
df$precinct_no <- unlist(Map(\(x)x[2], strsplit(df$precinct_no, "-")))
df
#> precinct_no newsom_count dahle_count difference
#> 1 100000 5 18 -13
#> 2 100090 567 622 -55
#> 3 100120 0 0 0
#> 4 100150 0 0 0
#> 5 105000 572 538 34
#> 6 105040 609 582 27