How to remove or replace familiar words in string, by list of familiar stubs of that words?
List contain ~40 stubs (substrings). With all of this I expect substitute hundreds matching words.
I'm interested in formula solution because I already know how to do this, coding VBA.
I play around TEXTJOIN with FILTERXML, but its not possible to use it for big stub list:
Formula 1. FILTERXML (one by one entered stub)
=IFERROR(TEXTJOIN(" ";;FILTERXML("<A><B>"&SUBSTITUTE(A2;" ";"</B><B>")&"</B></A>";"//B[not(contains(., 'auti') or contains(., 'AuTi') or contains(., 'tion') or contains(., 'mpl') or contains(., 'Mpl')or contains(., 'etc'))]"));"")
Result is good, but unfortunately is case sensitive, so one by one entered substrings for all capitalization version (Propercase, UPERCASE, lowercase, MixEdCase) generate a very big formula impossible to use because of length limitation (255 characters).
Formula 2. FILTERXML (with case translation)
=IFERROR(TEXTJOIN(" ";;FILTERXML("<T><S>"&SUBSTITUTE(A2;" ";"</S><S>")&"</S></T>";"//S[not(contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),'auti')or contains(., 'tion')or contains(., 'mpl')or contains(., 'etc'))]"));"")
Result is good apparently, but there is some strange missed capitalization matches (check cell D14 in attached image). To avoid this, I can use translation for each stub translate(., 'AUTI', 'auti'),'auti')
but length limitation will not leave me to use all list.
Formula 3. FILTERXML (with LOWER-ing all string)
=IFERROR(TEXTJOIN(" ";;FILTERXML("<A><B>"&SUBSTITUTE(LOWER(A2);" ";"</B><B>")&"</B></A>";"//B[not(contains(., 'auti')or contains(., 'tion') or contains(., 'mpl') or contains(., 'etc'))]"));"")
Result is good, but output string is totally lower. And again length limitation is still there.
Maybe FILTERXML is not a good way for this task, or I do something wrong. Unfortunately Xpath v1 doesn't allow things like matches()
CodePudding user response:
Very nice question, but you are stretching the capabilities of a formula made through Excel-2019 to it's very limits. Here is what I came up with considering the limitations of xpath 1.0 (no matches()
nor lists/array handling:
Formula in B1
:
=TEXTJOIN(" ",,INDEX(FILTERXML("<t><s>"&SUBSTITUTE(A1," ","</s><s>")&"</s></t>","//s"),FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,IFERROR(ROW($1:$99)&"<s>"&FILTERXML("<t><s>"&SUBSTITUTE(LOWER(A1)," ","</s><s>")&"</s></t>","//s"),""))&"</s></t></x>","//t[.//*[not(contains(., '"&TEXTJOIN("') or contains (., '",,{"auti","tion","mpl","etc"})&"'))]]")))
I have tested this against an array of 40 items of say 3-4 characters long and it did not exceed the limit of both TEXTJOIN()
's parameters nor the character limit it can handle.
It's an array-entered formula (CSE).
To summarize what this does:
- Split your input on space (keep case-specifications intact);
- Create a new input string using
TEXTJOIN()
to use another level of xpath childs to give each word an index; - With
FILTERXML()
we return each index that did not contain any of the array-elements you specified; - Use these numbers to retrieve all nodes from the 1st step and concat those back together;
- Note that this will impact punctuation.
Note that this will become significantly easier with ms365's lambda's helper function REDUCE()
:
=REDUCE(A1,{"auti","tion","mpl","etc"},LAMBDA(a,b,TEXTJOIN(" ",,FILTERXML("<t><s>"&SUBSTITUTE(a," ","</s><s>")&"</s></t>","//s[not(contains(translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'), '"&b&"'))]"))))