Home > Net >  How can I simplify the existing formula?
How can I simplify the existing formula?

Time:10-30

I am using a very extensive formula to remove all words from a string that start with lowercase letters and contains numerous special characters and other signs and symbols. The goal is to end up with only words that start with uppercase letters (if two words start with uppercase letters right after each other, then they are counted as one word). For example

This is the input

Obama Mama told: Reporters in 19. Washington-Post in That he and Netanyahu-lll are opposed to Iran's calls for "Death To America".

And this is the expected output

Obama Mama, Reporters, Washington-Post, That, Netanyahu-lll, Irans, Death To America

And this is the formula

=Regexreplace(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(INDEX(TEXTJOIN(" "; 1; LAMBDA(x;IF(REGEXMATCH(x&"";"(^[0-9a-zäüö])");"_";x))(SPLIT(G7;" "&CHAR(10)))));"(.*)\/|\|.*|\(.*\) |\.|»| - .*$| – |!|\?|\ |\„|\“|%| \& | \& |'|»|«|""";"");"(:| --)";" _");"(^[_\s] |[\s_] $)";"");"\s_ ";",");"([,] |,\s)";",")

This also works very well. I just wonder if there isn't a simpler way that just extracts all the words with capital letters.

Is there such a thing as a multiple REGEXEXTRACT? This one is only extracting the first word:

=regexextract(G7;"\b[A-Z].*?\b")

CodePudding user response:

Matching your output exactly, I can get it down to three regex. Try this:

=regexreplace(regexreplace(regexreplace(G7,"[^A-Za-z\'\-\ ""]",""),"\ [a-z\ ] \ ",", "),"[\'""]","")

If you need the output in separate cells, use SPLIT() around the regex.

enter image description here

CodePudding user response:

Is there such a thing as a multiple REGEXEXTRACT?

If you provide multiple capture groups(()), you'll get multiple extracts.

We can adapt the technique mentioned by enter image description here

  • Related