How to search for the most common sequence of 6 letters in Excel file?-CodePudding

I have a list of 50k URL's, and am looking for the most common 3/4/5/6 letters in these URLs (barring .com/.org/etc).

So if the URLs are strings.com and string2.com, it would tell me that string is the most common sequence of letters.

Is there a way to do this?

I tried =INDEX(range, MODE(MATCH(range, range, 0 ))), but it didn't work.

CodePudding user response：

The following will spill all possible consecutive 3-6 character substrings and their count in order of their respective count:

Formula in C2:

=LET(x,TOCOL(DROP(REDUCE(0,SEQUENCE(4,,3),LAMBDA(a,b,HSTACK(a,LET(c,REDUCE(0,A1:A3,LAMBDA(d,e,VSTACK(d,LET(f,MID(e,SEQUENCE(LEN(e)-b 1),b),f)))),c)))),1),2),y,UNIQUE(x),UNIQUE(SORT(HSTACK(y,MAP(y,LAMBDA(z,SUM(--(x=z))))),2,-1)))

CodePudding user response：

I am not sure to fully understand the ask. I assume you would like to find the most frequent sub-strings for a set of different sizes (6/5/4/3). Let's say your data is in column A. You would like to know the most common substrings for all possible sizes in the list.

In cell C1 we generate the expected sizes:

=TEXTSPLIT("6,5,4,3", ",")

Now from each length, we are going to find the most frequent pattern within that length from column A.

In cell C2, we are going to use the following formula:

=LET(maxLength, C1, strings, $A$2:$A$13, substr, LEFT(strings, maxLength),
  match, XMATCH(substr, UNIQUE(substr)),
  matchUX, UNIQUE(match), freq, DROP(FREQUENCY(match, matchUX), -1),
  maxFreq, MAX(freq), matchIdx, FILTER(matchUX, ISNUMBER(XMATCH(freq, maxFreq))),
  UNIQUE(FILTER(substr, ISNUMBER(XMATCH(match, matchIdx))))
)

and then expand it to the right, to consider different lengths on every column.

Here is the expected output:

You can have more than one sub-string with the highest frequency, so it can return an array instead of a single value. In the previous example, the output is just a single value, but the formula assumes more than one sub-string can have the maximum frequency.

Explanation

On each iteration (each column), we select from the name strings the corresponding substring (based on the maximum length). We use for that LEFT function.

Because substr name is an array (not a range, we cannot use any function from the COUNTIF* family). Instead, we are going to use XMATCH/FREQUENCY because both accept arrays.

The name match:

XMATCH(substr, UNIQUE(substr))

Helps to find the counts, via: XMATCH(lookup_value, lookup_array) every time the lookup_array value is found in lookup_value, it puts the same index position.

Now with the unique number of matches (matchUX), we can do the count via FREQUENCY. We don't need the last open bins (check the documentation of this function), so we drop the last row via DROP function.

Now we have the frequency (freq), and we need to find the maximum frequency. The FREQUENCY function returns the frequency for each bin (matchUX). So we are interested in all values from matchUX that has the maximum frequency (maxFreq). We use FILTER function for that. We can have more than one filtered result, i.e. the scenario where two sub-strings have the maximum frequency.

Then the name matchIdx, has all the indexes from match name, corresponding to the maximum frequency. What we need now is to find the corresponding substr values. The names match and substr have the same size, so we can use FILTER again to find such substr values via matchIdx using XMATCH function.

Finally, we use UNIQUE in case the maximum frequency was reached by the same substring, so we are interested only in unique substrings.