Home > Enterprise >  Highlight near duplicate in conditional formating to highlight values with one character difference
Highlight near duplicate in conditional formating to highlight values with one character difference

Time:01-25

I'm currently using this formula to highlight duplicates in my spreadsheet.

=ARRAYFORMULA(COUNTIF(A$2:$A2,$A2)>1)

Quite simple, it allows me to skip the first occurrence and only highlight 2nd, 3rd, ... occurrences.

I would like the formula to go a bit further and highlight near duplicates as well. Meaning if there is only one character difference between 2 cells, then it should be considered as a duplicate.

For instance: "Marketing", "Marketng", "Marketingg" and "Market ing" would all be considered the same.

I've made a sample sheet in case my requirement is not straightforward to understand.

Thanks in advance.

CodePudding user response:

Answer

Unfortunately, it is not possible to do this only through Formulas. Apps Scripts are need as well. The process for achieving your desired results is described below.

In Google Sheets, go to Extensions > Apps Script, paste the following code1 and save.

function TypoFinder(range, word) { // created by https://stackoverflow.com/users/19361936
  if (!Array.isArray(range) || word == "") {
    return false;
  }
  distances = range.map(row => row.map(cell => Levenshtein(cell, word))) // Iterate over range and check Levenshtein distance.
  var accumulator = 0;
  for (var i = 0; i < distances.length; i  ) {
    if (distances[i] < 2) {
      accumulator  
    } // Keep track of how many times there's a Levenshtein distance of 0 or 1.
  }
  return accumulator > 1;
}

function Levenshtein(a, b) { // created by https://stackoverflow.com/users/4269081
  if (a.length == 0) return b.length;
  if (b.length == 0) return a.length;

  // swap to save some memory O(min(a,b)) instead of O(a)
  if (a.length > b.length) {
    var tmp = a;
    a = b;
    b = tmp;
  }

  var row = [];
  // init the row
  for (var i = 0; i <= a.length; i  ) {
    row[i] = i;
  }

  // fill in the rest
  for (var i = 0; i < b.length; i  ) {
    var prev = i;
    for (var j = 0; j < a.length; j  ) {
      var val;
      if (b.charAt(i) == a.charAt(j)) {
        val = row[j]; // match
      } else {
        val = Math.min(row[j]   1, // substitution
          prev   1, // insertion
          row[j   1]   1); // deletion
      }
      row[j] = prev;
      prev = val;
    }
    row[a.length] = prev;
  }

  return row[a.length];
}

In cell B1, enter =TypoFinder($A$2:$A2,$A2). Autofill that formula down the column by draggin.

Create a conditional formatting rule for column A. Using Format Rules > Custom Formula, enter =B2:B.

At this point, you might wish to hide column B. To do so, right click on the column and press Hide Column.

The above explanation assumes the column you wish to highlight is Column A and the helper column is column B. Adjust appropriately.

Note that I have assumed you do not wish to highlight repeated blank columns as duplicate. If I am incorrect, remove || word == "" from line 2 of the provided snippet.

Explanation

The concept you have described is called Levenshtein Distance, which is a measure of how close together two strings are. There is no built-in way for Google Sheets to process this, so the Levenshtein() portion of the snippet above implements a custom function to do so instead. Then the TypoFinder() function is built on top of it, providing a method for evaluating a range of data against a specified "correct" word (looking for typos anywhere in the range).

Next, a helper column is used because Sheets has difficulties parsing custom formulas as part of a conditional formatting rule. Finally, the rule itself is implemented to check the helper column's determination of whether the row should be highlighted or not. Altogether, this highlights near-duplicate results in a specified column.


1 Adapted from duality's answer to a related question.

  • Related