Ruby - array difference method which takes position into account-CodePudding

I'm trying to find the number of changed words between two sentences, by converting the sentences to arrays of words. I can't use the standard Array#- method, because that removes instances of the same value wherever it appears, but I want to keep those instances in the results.

To illustrate using arrays of numbers:

arr1 = [1, 2, 3, 4]
arr2 = [4, 1, 2, 1, 2, 3, 4, 4]
arr2 - arr1
=> [] #I want [4, 1, 2, 4]

#another example: simple reordering.  

arr3 = [1, 2, 3, 4]
arr4 = [4, 3, 2, 1]
arr4 - arr3
=> [] #I want [4, 3, 2, 1]

It's not as simple as just iterating through and seeing if arr1[i] is equal to arr2[i] for every value of i - if I did that, then in the first example it would return the whole of arr2, because the addition of the 4 at the start has pushed every value onto a new index.

What I need to do, I think, is something along the lines of 'get every set of values in arr1 and try removing that *set of values from arr2, then at the end I should be left with all of the new sets of values (ie phrases) in arr2, and that's what I'm going to count as the number of words in the edit. But, it wouldn't be every set, since that would include lots of two-value sets, it's more like "look for the longest sets you can remove".

Does anyone know if there's a relatively straightforward way to do this?

CodePudding user response：

After reading it through the question again, and looking at the comments, you're looking to remove a range from an array if it matches another array? (I'm certain that there are possible optimisations)

Code:

class Array
  def minus(other)
    found=-1
    (0..(self.length-other.length)).each do |x|
      if self[x,other.length] == other
        found = x
        break
      end
    end
    if found >= 0
      self[0, found]   self[found other.length,self.length]
    else
      self
    end
  end
end

examples of output:

# array minus itself
2.7.2 :263 > arr2.minus arr2
 => []

# your arrays
2.7.2 :264 > arr2.minus arr1
 => [4, 1, 2, 4]

2.7.2 :265 > arr4.minus arr3
 => [4, 3, 2, 1]

# an array with repeated elements
2.7.2 :275 > arr6 = [1, 3, 2, 4, 2, 4, 2, 4, 2, 4]
 => [1, 3, 2, 4, 2, 4, 2, 4, 2, 4]
2.7.2 :276 > arr5 = [4, 2, 4]
 => [4, 2, 4]

2.7.2 :277 > arr6.minus arr5
 => [1, 3, 2, 2, 4, 2, 4]

CodePudding user response：

Iterating over the elements in the first array, we immediately return the second array if the item is not found in a clone of the second. This short-circuits the rest of the method, of course.

If it is we push all proceeding elements onto a result using #shift to remove them from temp so they're not considered on the next iteration. We shift again to remove the item we were searching for. At the end we append whatever's left of temp and flatten the array.

def solve(arr1, arr2)
  temp = arr2.clone

  arr1.each_with_object([]) do |x, r|
    pos = temp.find_index(x)
    return arr2 unless pos
    r.concat(temp.shift(pos))
    temp.shift
  end   temp
end

Now:

solve(arr1, arr2)
# => [4, 1, 2, 4]

solve(arr3, arr4)
# => [4, 3, 2, 1]

solve("hello world wooble".split, "hello foo world bar baz wooble".split)
# => ["foo", "bar", "baz"]

solve("hello world wooble".split, "hello foo world bar baz".split)
# => ["hello", "foo", "world", "bar", "baz"]

CodePudding user response：

TL;DR

You're actually asking multiple questions, and there are algorithms like the Levenshtein distance or the longest common substring (LCS) problem, which is considered NP-hard. You can also find some other edit-distance algorithms if you look around with the right keywords.

Since you haven't explained what you're doing with the resulting outputs, it's hard to give you an idiomatic solution that will fit all your use cases, but these solutions (tested with Ruby 3.1.2, but should work with Ruby 2.7.6 ) should at least get you started.

Naive Matching of Array Sequences

Given the very basic example in your original first question, and making some assumptions such as:

You are looking for complete matches of arr1, not just the longest subsequence.
arr1 will always be shorter than arr2.
The subsequence could match zero or more times.

then you can use the #size of arr1 to create a sliding window over arr2, and then #select matches that contain the same value as arr1. Since arrays (unlike sets or set-like difference or union operations) are ordered, [1, 2, 3] == [3, 2, 1] #=> false, which means you'll only get matches when a sub-sequence of arr2 is an exact match for a sub-sequence in arr1 of the same length and with elements in the same ordinal positions.

Consider the following:

arr1 = [1, 2, 3, 4]
arr2 = [4, 1, 2, 1, 2, 3, 4, 4]

subseq = arr2.each_cons(arr1.size).select { _1 == arr1 }
#=> [[1, 2, 3, 4]]

The end result is an array-of-arrays, which will either be #empty? if no sub-sequences were found, or the matched sequences. You can then iterate over the matches, if any, or just do something with subseq.pop if you want to work with a single value.

You may have to validate the assumptions above, or pursue a different solution if you are looking for matching substrings of arr1 with arr2, but while a bit naive the solution matches the inputs and outputs you provided.

Matching Items by Index

This is a different problem, because you are no longer really looking at sub-sequences. Instead, you are comparing items at a given index. There are a couple of ways you might solve this quickly without resorting to complex algorithms, including #zip and #each_with_index. Consider the following examples.

Comparing by Index

The following naive solution assumes that:

Your arrays are both the same size.
You want to return any value of arr2 that isn't isn't in the same index location as arr1.

arr1 = [1, 2, 3, 4]
arr2 = [4, 3, 2, 1]

arr1.each_with_index do |elem, idx|
    arr2[idx] unless elem == arr2[idx]
end
#=> [1, 2, 3, 4]

This will of course give bad results if the two arrays are of different sizes. In that case, you might need a smarter comparison such as:

arr1 = [1, 2, 3, 4, 5]
arr2 = [4, 3, 2, 1, 0, -1]

(0...[arr1.size, arr2.size].max).map do |idx|
  arr1[idx] == arr2[idx] ? nil : arr2[idx]
end
#=> [4, 3, 2, 1, 0, -1]

which will show elements from arr2 that are different, even if there's no matching index in arr1.

When Arrays Aren't Unique

This solution return values that are different, and nil where they are the same so that you can see no word was changed at that index. For example:

arr1 = [1, 3, 2, 4]
arr2 = [4, 3, 2, 1]

arr1.each_with_index.
  map { _1 == arr2[_2] ? nil : arr2[_2] }
#=> [4, nil, nil, 1]

If you don't want nil then replace it with _1 like so:

arr1 = [1, 2, 3, 4]
arr2 = [4, 3, 2, 1]

arr1.each_with_index.
  map { _1 == arr2[_2] ? nil : arr2[_2] }
#=> [4, 3, 2, 1]

Handling Unbalanced Arrays

You also have an edge case where your arrays may not be balanced. Another somewhat naive solution is to simply #zip the arrays, return values that don't match the sub-arrays or hashed key/value pairs, and then deal with the "overhang" of unbalanced items. For example:

arr1 = [1, 2, 3, 4, 5]
arr2 = [4, 3, 2, 1, 0, -1]

changed = arr1.zip(arr2).to_h.
  map { _1 == _2 ? nil : _2 }
  .append (arr1.size - arr2.size)
  .negative? ? arr2.last : arr1.last
#=> [4, 3, 2, 1, 0, -1]

Conclusion

All of these approaches basically rely on indexing rather than set-matching. They're also all pretty fast as they use built-in objects and methods rather than hand-crafted algorithms. They may not be a comprehensive replacement for specific LCS or edit-distance algorithms, but they all solve aspects of your stated problem with repeatable and predictable results so long as you meet the foundational assumptions.

If none of these fit your exact use case, they should at least get you started with devising your own after you identify whatever edge cases may apply. Failing that, a more complex solution may be needed, but these approaches have served me well for most of the use cases I've run across in real-world code. Your mileage may certainly vary.