Home > Blockchain >  Remove matching strings by multiple substrings using excel formula
Remove matching strings by multiple substrings using excel formula

Time:03-15

How to remove or replace familiar words in string, by list of familiar stubs of that words?

List contain ~40 stubs (substrings). With all of this I expect substitute hundreds matching words.

I'm interested in formula solution because I already know how to do this, coding VBA.


I play around TEXTJOIN with FILTERXML, but its not possible to use it for big stub list:

Formula 1. FILTERXML (one by one entered stub)

=IFERROR(TEXTJOIN(" ";;FILTERXML("<A><B>"&SUBSTITUTE(A2;" ";"</B><B>")&"</B></A>";"//B[not(contains(., 'auti') or contains(., 'AuTi') or contains(., 'tion') or contains(., 'mpl') or contains(., 'Mpl')or contains(., 'etc'))]"));"")

Result is good, but unfortunately is case sensitive, so one by one entered substrings for all capitalization version (Propercase, UPERCASE, lowercase, MixEdCase) generate a very big formula impossible to use because of length limitation (255 characters).

Formula 2. FILTERXML (with case translation)

=IFERROR(TEXTJOIN(" ";;FILTERXML("<T><S>"&SUBSTITUTE(A2;" ";"</S><S>")&"</S></T>";"//S[not(contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),'auti')or contains(., 'tion')or contains(., 'mpl')or contains(., 'etc'))]"));"")

Result is good apparently, but there is some strange missed capitalization matches (check cell D14 in attached image). To avoid this, I can use translation for each stub translate(., 'AUTI', 'auti'),'auti') but length limitation will not leave me to use all list.

Formula 3. FILTERXML (with LOWER-ing all string)

=IFERROR(TEXTJOIN(" ";;FILTERXML("<A><B>"&SUBSTITUTE(LOWER(A2);" ";"</B><B>")&"</B></A>";"//B[not(contains(., 'auti')or contains(., 'tion') or contains(., 'mpl') or contains(., 'etc'))]"));"")

Result is good, but output string is totally lower. And again length limitation is still there.

example image

Maybe FILTERXML is not a good way for this task, or I do something wrong. Unfortunately Xpath v1 doesn't allow things like matches()

CodePudding user response:

Very nice question, but you are stretching the capabilities of a formula made through Excel-2019 to it's very limits. Here is what I came up with considering the limitations of xpath 1.0 (no matches() nor lists/array handling:

enter image description here

Formula in B1:

=TEXTJOIN(" ",,INDEX(FILTERXML("<t><s>"&SUBSTITUTE(A1," ","</s><s>")&"</s></t>","//s"),FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,IFERROR(ROW($1:$99)&"<s>"&FILTERXML("<t><s>"&SUBSTITUTE(LOWER(A1)," ","</s><s>")&"</s></t>","//s"),""))&"</s></t></x>","//t[.//*[not(contains(., '"&TEXTJOIN("') or contains (., '",,{"auti","tion","mpl","etc"})&"'))]]")))

I have tested this against an array of 40 items of say 3-4 characters long and it did not exceed the limit of both TEXTJOIN()'s parameters nor the character limit it can handle.

It's an array-entered formula (CSE).

To summarize what this does:

  • Split your input on space (keep case-specifications intact);
  • Create a new input string using TEXTJOIN() to use another level of xpath childs to give each word an index;
  • With FILTERXML() we return each index that did not contain any of the array-elements you specified;
  • Use these numbers to retrieve all nodes from the 1st step and concat those back together;
  • Note that this will impact punctuation.

Note that this will become significantly easier with ms365's lambda's helper function REDUCE():

=REDUCE(A1,{"auti","tion","mpl","etc"},LAMBDA(a,b,TEXTJOIN(" ",,FILTERXML("<t><s>"&SUBSTITUTE(a," ","</s><s>")&"</s></t>","//s[not(contains(translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'), '"&b&"'))]"))))
  • Related