Home > Software engineering >  Regex extract, remove duplicates, and join with a pipe in Google Sheets
Regex extract, remove duplicates, and join with a pipe in Google Sheets

Time:10-14

I have rows of data like the following:

Cat A>Subcat A|Cat A>Subcat B
Cat A>Subcat C|Cat B>Subcat A|Cat B>Subcat C|Cat C

You'll notice that it's basically a list of parent categories and subcategories separated by a pipe |

I am needing to extract data from each row in two following ways:

  1. Get all parent categories and separate them by a pipe | (removing duplicates).
  2. Get all subcategory names and separate them by a pipe | (removing duplicates).

From those first two provided rows, the result should look like:

String Parents (Result 1) Children (Result 2)
Cat A>Subcat A|Cat A>Subcat B Cat A Subcat A|Subcat B
Cat A>Subcat C|Cat B>Subcat A|Cat B>Subcat C|Cat C Cat A|Cat B|Cat C Subcat C|Subcat A

I've been able to achieve partial results using REGEXEXTRACT and JOIN but it either matches only once, or returns multiple. Example:

# Returns the first instance of "Cat A" only
=REGEXEXTRACT(H2,"(.*?)>.*?\|")

I'm looking to get help creating two regex patterns that can get the desired "Result 1" and "Result 2"

CodePudding user response:

try:

=ARRAYFORMULA(REGEXREPLACE(TRIM(SUBSTITUTE(TRANSPOSE(QUERY(QUERY(QUERY(
 IFNA(SPLIT(UNIQUE(FLATTEN(ROW(A1:A2)&"×"&
 REGEXREPLACE(REGEXEXTRACT(SPLIT(A1:A2&">", "|", 1), "(.*)>"), "(>.*)", )))&"|", "×")), 
 "select max(Col2) 
  where Col1 is not null 
  group by Col2 
  pivot Col1"), 
 "offset 1", 0),,9^9)), "| ", "|")), "\|$", ))

enter image description here

and:

=ARRAYFORMULA(REGEXREPLACE(TRIM(SUBSTITUTE(TRANSPOSE(TRIM(QUERY(QUERY(QUERY(
 IFNA(SPLIT(UNIQUE(FLATTEN(ROW(A1:A2)&"×"&
 REGEXREPLACE(REGEXEXTRACT(SPLIT(A1:A2, "|", 1), ">(.*)"), "(>.*)", )))&"|", "×")), 
 "select max(Col2) 
  where Col1 is not null 
  group by Col2 
  pivot Col1"), 
 "offset 1", 0),,9^9))), "| ", "|")), "\|$", ))

enter image description here

  • Related