Check a cell contains the same count of each letter and number as an adjacent cell-CodePudding

Summary

I would like to check that a cell contains the identical number of each letter from a-z and number from 0-9, as the cell next to it. The order does not matter, but the number of characters does. For example, if a cell contained "flat 1, 32 test road", and the cell next to it contained "32, flat 1, test road", it would match, as they both contain 3 t's, 2 a's, 1 l, etc.).

Example Table

An example of the expected output is below, with the formula being populated in column C:

Data 1	Data 2	Match?
123 test road	testroad123	MATCH
FOOBAR 34	FOO,/,34BAR	MATCH
HELLOWORLD1	HELLOWORLD2	NO MATCH
FLAT4,33 STREET NAME	STREET NAME 33,FLAT 4	MATCH
12345	12345 Road	NO MATCH

Working code

Currently, I have this working by first converting the contents of cell to lowercase, then individually checking each letter from a-z, and each number from 0-9 in an IF(AND) statement. It works, but looks horrible and undoubtedly is incredibly inefficient and resource intensive. The code is as follows: =IF(AND(LEN(A1)-LEN(SUBSTITUTE(A1,"a",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"a","")),LEN(A1)-LEN(SUBSTITUTE(A1,"b",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"b","")),LEN(A1)-LEN(SUBSTITUTE(A1,"c",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"c","")),LEN(A1)-LEN(SUBSTITUTE(A1,"d",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"d","")),LEN(A1)-LEN(SUBSTITUTE(A1,"e",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"e","")),LEN(A1)-LEN(SUBSTITUTE(A1,"f",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"f","")),LEN(A1)-LEN(SUBSTITUTE(A1,"g",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"g","")),LEN(A1)-LEN(SUBSTITUTE(A1,"h",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"h","")),LEN(A1)-LEN(SUBSTITUTE(A1,"i",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"i","")),LEN(A1)-LEN(SUBSTITUTE(A1,"j",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"j","")),LEN(A1)-LEN(SUBSTITUTE(A1,"k",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"k","")),LEN(A1)-LEN(SUBSTITUTE(A1,"l",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"l","")),LEN(A1)-LEN(SUBSTITUTE(A1,"m",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"m","")),LEN(A1)-LEN(SUBSTITUTE(A1,"n",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"n","")),LEN(A1)-LEN(SUBSTITUTE(A1,"o",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"o","")),LEN(A1)-LEN(SUBSTITUTE(A1,"p",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"p","")),LEN(A1)-LEN(SUBSTITUTE(A1,"q",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"q","")),LEN(A1)-LEN(SUBSTITUTE(A1,"r",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"r","")),LEN(A1)-LEN(SUBSTITUTE(A1,"s",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"s","")),LEN(A1)-LEN(SUBSTITUTE(A1,"t",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"t","")),LEN(A1)-LEN(SUBSTITUTE(A1,"u",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"u","")),LEN(A1)-LEN(SUBSTITUTE(A1,"v",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"v","")),LEN(A1)-LEN(SUBSTITUTE(A1,"w",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"w","")),LEN(A1)-LEN(SUBSTITUTE(A1,"x",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"x","")),LEN(A1)-LEN(SUBSTITUTE(A1,"y",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"y","")),LEN(A1)-LEN(SUBSTITUTE(A1,"z",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"z","")),LEN(A1)-LEN(SUBSTITUTE(A1,"0",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"0","")),LEN(A1)-LEN(SUBSTITUTE(A1,"1",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"1","")),LEN(A1)-LEN(SUBSTITUTE(A1,"2",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"2","")),LEN(A1)-LEN(SUBSTITUTE(A1,"3",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"3","")),LEN(A1)-LEN(SUBSTITUTE(A1,"4",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"4","")),LEN(A1)-LEN(SUBSTITUTE(A1,"5",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"5","")),LEN(A1)-LEN(SUBSTITUTE(A1,"6",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"6","")),LEN(A1)-LEN(SUBSTITUTE(A1,"7",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"7","")),LEN(A1)-LEN(SUBSTITUTE(A1,"8",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"8","")),LEN(A1)-LEN(SUBSTITUTE(A1,"9",""))=LEN(B1)-LEN(SUBSTITUTE(B1,"9",""))),TRUE,FALSE)

As you can see, it's an eyesore. It works, but I will need to apply this to at least 100,000 rows of data, and I believe it will be too intensive to work reliably. The current solution is to go through each letter and make sure the count of them matches, and return TRUE if they all do.

Conclusion

I have a working solution that gives an example of what is required, but it's clunky and unreliable. I'm hoping there's a better way to utilise Excel in order to complete this task more efficiently. Thanks in advance for your time!

I am using MSO 365, Excel version 2202

CodePudding user response：

Alternative using LEN:

=AND(MMULT(LEN(A2:B2)-LEN(SUBSTITUTE(UPPER(A2:B2),MID("0123456789ABCDEFGHIJKLMNOPQRSTUVXYZ",SEQUENCE(36),1),"")),{-1;1})=0)

and copied down.

CodePudding user response：

Very nice question to work on. Here is one option that will spill results:

Formula in C2:

=BYROW(A2:B6,LAMBDA(x,COUNTA(UNIQUE(MAP(x,LAMBDA(a,CONCAT(LET(b,MID(a,SEQUENCE(LEN(a)),1),c,CODE(UPPER(b)),SORT(IFERROR(--b,IF((c>64)*(c<91),b,""))))))),1))))=1

In short:

BYROW() - Loop over each row in the dataset;
MAP() - Each iteration of the above loop will pass both values through this function to split each element in characters, check if these are numbers, if not then check against the ASCII CODE() table, if neither return empty string. SORT() these characters and CONCAT() the result back together;
COUNTA(UNIQUE()) - Combination to test if the above iteration returned two of the same values (case insensitive).

CodePudding user response：

Nice answer from @JvdV.

I decided to write a lambda, CharFreq, to split the strings into characters and used Frequency to generate the frequencies of the characters and numbers. My lambda looked like this in the Advanced Formula Environment:

=LAMBDA(string,
    LET(
        codes, SEARCH(
            MID(
                string,
                SEQUENCE(
                    LEN(
                        string
                    )
                ),
                1
            ),
            "0123456789abcdefghijklmnopqrstuvxyz"
        ),
        fcodes, FILTER(
            codes,
            ISNUMBER(codes)
        ),
        freq, FREQUENCY(
            fcodes,
            SEQUENCE(36)
        ),
        freq
    )
)

Then I could just compare the frequencies to identify any discrepancies:

=IF(SUM(--(CharFreq(A2)<>CharFreq(B2)))=0,"Match","No match")

Note

This could give the wrong result if a wildcard character like "*" or "?" occurred in one of the strings - could be fixed either by using find with lower instead of search, or by substituting out those characters.