Home > Software design >  Regex for replacing commas only within brackets
Regex for replacing commas only within brackets

Time:09-16

I have an ingredient dataset and each row is a list of ingredients that's separated by comma, for example:

Oats (24%) (Rolled, Bran), Coconut (13%) (Coconut , Preservative (220, 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame , Sunflower), Margarine (Vegetable Oil, Water, Salt, Emulsifiers (471, Soy Lecithin), Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar, Vegetable Oil, Milk Solids, Cocoa Powder, Emulsifiers (Soy Lecithin, 492), Natural Flavour), Natural Flavour

I want to parse the file to replace only commas within brackets with a semicolon. There can be any number of brackets and any number of commas within the brackets. The result should look like this:

Oats (24%) (Rolled;Bran), Coconut (13%) (Coconut ; Preservative (220; 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame ;Sunflower), Margarine (Vegetable Oil; Water; Salt; Emulsifiers (471; Soy Lecithin); Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar; Vegetable Oil; Milk Solids; Cocoa Powder; Emulsifiers (Soy Lecithin; 492); Natural Flavour), Natural Flavour

Can I get some help on regex that will solve the problem? Thank you in advance.

CodePudding user response:

1) gsubfn This can be done without complex regular expressions using gsubfn. The regular expression consisting of a dot matches a single character. Then for each match fun is run with that character passed to it via the x argument. this$k within fun refers to a counter that starts at 0 and is incremented by 1 each time a ( is encountered and decremented by 1 each time a ) is encountered. If the counter is not zero and a comma is encountered a semicolon is returned to replace the comma; otherwise, the input character is returned.

library(gsubfn)

p <- proto(k = 0, fun = function(this, x) {
  if (x == "(") this$k <- k   1
  if (x == ")") this$k <- k - 1
  if (k && x == ",") ";" else x
})
gsubfn(".", p, s)

giving:

[1] "Oats (24%) (Rolled; Bran), Coconut (13%) (Coconut ; Preservative (220; 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame ; Sunflower), Margarine (Vegetable Oil; Water; Salt; Emulsifiers (471; Soy Lecithin); Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar; Vegetable Oil; Milk Solids; Cocoa Powder; Emulsifiers (Soy Lecithin; 492); Natural Flavour), Natural Flavour"

2) Base R A base R solution is to split the input into single characters giving character vector chars and then create a counter vector, k, the same length as chars which indicates the number of ( up to each character minus the number of ). Then replace those commas corresponding to a nonzero k with semicolon and transform chars back to a single string.

chars <- strsplit(s, "")[[1]]
k <- cumsum((chars == "(") - (chars == ")"))
chars[k & chars == ","] <- ";"
paste(chars, collapse = "")

Note

Input string s is the following.

s <- "Oats (24%) (Rolled, Bran), Coconut (13%) (Coconut , Preservative (220, 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame , Sunflower), Margarine (Vegetable Oil, Water, Salt, Emulsifiers (471, Soy Lecithin), Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar, Vegetable Oil, Milk Solids, Cocoa Powder, Emulsifiers (Soy Lecithin, 492), Natural Flavour), Natural Flavour"

CodePudding user response:

You can use ?R like.

i <- gregexpr("\\(([^()]|(?R))*\\)", s, perl=TRUE)
regmatches(s, i)[[1]] <- gsub(",", ";", regmatches(s, i)[[1]])

s
#[1] "Oats (24%) (Rolled; Bran), Coconut (13%) (Coconut ; Preservative (220; 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame ; Sunflower), Margarine (Vegetable Oil; Water; Salt; Emulsifiers (471; Soy Lecithin); Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar; Vegetable Oil; Milk Solids; Cocoa Powder; Emulsifiers (Soy Lecithin; 492); Natural Flavour), Natural Flavour"

Where a(?R)z is a recursion which match one or more letters a followed by exactly the same number of letters z.

Data

s <- "Oats (24%) (Rolled, Bran), Coconut (13%) (Coconut , Preservative (220, 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame , Sunflower), Margarine (Vegetable Oil, Water, Salt, Emulsifiers (471, Soy Lecithin), Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar, Vegetable Oil, Milk Solids, Cocoa Powder, Emulsifiers (Soy Lecithin, 492), Natural Flavour), Natural Flavour"
  • Related