patterns <- c("Athens, Greece", "New York, New York, USA", "Georgia,USA", "Southern California, USA ")
I have a collection of strings in patterns
, and I would like to only focus on those that have a single comma. For example, the string New York, New York, USA
should be discarded. I tried the following regular expression to find the strings that only have 1 comma but it didn't work.
grep(",{1}", patterns)
> [1] 1 2 3 4
My ultimate goal is to re-order these strings, so that the final output looks something like this: the string after the comma shows up first, comma is removed, and excess spaces are deleted
final_output
> [1] "Greece Athens" "USA Georgia" "USA Southern California"
CodePudding user response:
Here is a regex:
patterns <- c("Athens, Greece", "New York, New York, USA",
"Georgia,USA", "Southern California, USA ")
grep("^[^,]*,[^,]*$", patterns)
#> [1] 1 3 4
Created on 2022-08-21 by the reprex package (v2.0.1)
Explanation:
^[^,]*
searches any character but a comma, starting at the beginning of the string;,
a literal comma;[^,]*$
followed by anything but a comma until the end of the string;- combined, the above search one comma only, with no other commas before or after it.
Index by grep
's result or use argument value
.
grep("^[^,]*,[^,]*$", patterns, value = TRUE)
#> [1] "Athens, Greece" "Georgia,USA"
#> [3] "Southern California, USA "
Created on 2022-08-21 by the reprex package (v2.0.1)
As for the second goal, here is a way. Once again, base R only.
patterns <- c("Athens, Greece", "New York, New York, USA",
"Georgia,USA", "Southern California, USA ", "This That")
v <- grep("^[^,]*,[^,]*$", patterns, value = TRUE)
sapply(strsplit(v, ","), \(x) paste(trimws(rev(x)), collapse = " "))
#> [1] "Greece Athens" "USA Georgia"
#> [3] "USA Southern California"
Created on 2022-08-21 by the reprex package (v2.0.1)
CodePudding user response:
First find out strings that have exactly one comma and use that to extract relevant strings in patterns
. Then capture all characters before AND after the comma into two capture groups (note the brackets ()
). Then replace the string with the second capture group \\2
followed by a space
, then the first capture group \\1
.
library(stringr)
sub("^(\\w. ?),\\s*(\\w. ?)\\s{0,}$",
"\\2 \\1",
patterns[str_count(patterns, ",") == 1])
[1] "Greece Athens" "USA Georgia"
[3] "USA Southern California"
CodePudding user response:
You could use gregexpr()
to see how many commas in the strings.
n.comma <- sapply(gregexpr(',', patterns), \(x) sum(x > 0))
n.comma
# [1] 1 2 1 1
For your second goal:
sub('(. )\\s*,\\s*(. )', '\\2 \\1', trimws(patterns)[n.comma == 1])
# [1] "Greece Athens" "USA Georgia" "USA Southern California"
CodePudding user response:
Here's a regex free version - if you want to remove those with more than one comma, then this combination of using str_count
and str_trim
will work:
library(stringr)
res = str_split(patterns[str_count(patterns, ",") < 2], ",", simplify=T)
str_trim(paste(res[,2], res[,1], sep=" "))
[1] "Greece Athens" "USA Georgia"
[3] "USA Southern California"
CodePudding user response:
gregexpr(",", patterns) %>% lapply(length) %>% unlist()
Output:
[1] 1 2 1 1
This gives you the number of commas in each string.