How to re-order strings that are separated by a comma?-CodePudding

patterns <- c("Athens, Greece", "New York, New York, USA", "Georgia,USA", "Southern California,    USA ")

I have a collection of strings in patterns, and I would like to only focus on those that have a single comma. For example, the string New York, New York, USA should be discarded. I tried the following regular expression to find the strings that only have 1 comma but it didn't work.

grep(",{1}", patterns)
> [1] 1 2 3 4

My ultimate goal is to re-order these strings, so that the final output looks something like this: the string after the comma shows up first, comma is removed, and excess spaces are deleted

final_output 
> [1] "Greece Athens"  "USA Georgia"  "USA Southern California"

CodePudding user response：

Here is a regex:

patterns <- c("Athens, Greece", "New York, New York, USA", 
              "Georgia,USA", "Southern California,    USA ")

grep("^[^,]*,[^,]*$", patterns)
#> [1] 1 3 4

^{Created on 2022-08-21 by the reprex package (v2.0.1)}

Explanation:

^[^,]* searches any character but a comma, starting at the beginning of the string;
, a literal comma;
[^,]*$ followed by anything but a comma until the end of the string;
combined, the above search one comma only, with no other commas before or after it.

Index by grep's result or use argument value.

grep("^[^,]*,[^,]*$", patterns, value = TRUE)
#> [1] "Athens, Greece"               "Georgia,USA"                 
#> [3] "Southern California,    USA "

^{Created on 2022-08-21 by the reprex package (v2.0.1)}

As for the second goal, here is a way. Once again, base R only.

patterns <- c("Athens, Greece", "New York, New York, USA", 
              "Georgia,USA", "Southern California,    USA ", "This That")

v <- grep("^[^,]*,[^,]*$", patterns, value = TRUE)
sapply(strsplit(v, ","), \(x) paste(trimws(rev(x)), collapse = " "))
#> [1] "Greece Athens"           "USA Georgia"            
#> [3] "USA Southern California"

^{Created on 2022-08-21 by the reprex package (v2.0.1)}

CodePudding user response：

First find out strings that have exactly one comma and use that to extract relevant strings in patterns. Then capture all characters before AND after the comma into two capture groups (note the brackets ()). Then replace the string with the second capture group \\2 followed by a space , then the first capture group \\1.

library(stringr)

sub("^(\\w. ?),\\s*(\\w. ?)\\s{0,}$", 
    "\\2 \\1", 
    patterns[str_count(patterns, ",") == 1])

[1] "Greece Athens"           "USA Georgia"            
[3] "USA Southern California"

CodePudding user response：

You could use gregexpr() to see how many commas in the strings.

n.comma <- sapply(gregexpr(',', patterns), \(x) sum(x > 0))
n.comma
# [1] 1 2 1 1

For your second goal:

sub('(. )\\s*,\\s*(. )', '\\2 \\1', trimws(patterns)[n.comma == 1])

# [1] "Greece Athens"   "USA Georgia"   "USA Southern California"

CodePudding user response：

Here's a regex free version - if you want to remove those with more than one comma, then this combination of using str_count and str_trim will work:

library(stringr)
res = str_split(patterns[str_count(patterns, ",") < 2], ",", simplify=T)

str_trim(paste(res[,2], res[,1], sep=" "))

[1] "Greece Athens"           "USA Georgia"            
[3] "USA Southern California"

CodePudding user response：

gregexpr(",", patterns) %>% lapply(length) %>% unlist()

Output:

[1] 1 2 1 1

This gives you the number of commas in each string.