Home > OS >  How to find the difference between two lists consisted of different data types with Python
How to find the difference between two lists consisted of different data types with Python

Time:12-01

Description:

Type A:

([1, 2, 3, 4, 5, 6, 7, 8], [1, 3, 4, 5, 6, 7, 8]) ➞ 2

Type B:

([True, True, False, False, True], [False, True, False, True]) ➞ True

Type C:

(["Jane", "is", "pretty", "ugly"], ["Jane", "is", "pretty"]) ➞ "ugly"

Type D:

(["different", "types", "5", 5, [5], (5,)], ["5", "different", [5], "types", 5]) ➞ (5,)

Progress: For A/C, I would first turn them into two sets, then use set.difference() or "-" to find the difference. However, this method does not apply to B, but I can still use "Counter()" to get the repeated times of each element and then find out the difference.

Problem: Now I am stuck with D. I think whether it's "set.difference()" or "-" or "Counter()", they can work pretty well with lists consisted of the same data type, but D is consisted of string, integer, list, tuple, i.e., different data types, so how to deal with it in Python?

Next Step: To me, the most straightforward way is to find a function that can at least count the types of these elements in D, so I can compare them first before I move on to the values of these elements. Is it a good way?

CodePudding user response:

One brute-force way is to loop over items in the second list, removing the item from the first list:

a = ["different", "types", "5", 5, [5], (5,)]
b = ["5", "different", [5], "types", 5]

a_copy = a.copy()
for x in b:
    a_copy.remove(x)
print(a_copy) # [(5,)]

It assumes that b is a sublist of a (up to permutation).

CodePudding user response:

The main issue is that set elements and dictionary keys (a Counter uses a dict internally) must be hashable, but there is no guarantee that the elements in the two given lists are hashable. In your example D, there is the list [5], which is unhashable because it is mutable.

According to Python docs:

Most of Python’s immutable built-in objects are hashable; mutable containers (such as lists or dictionaries) are not; immutable containers (such as tuples and frozensets) are only hashable if their elements are hashable.

If there are no limitations on the items you can get in the input lists, this problem seems difficult without using a brute-force solution: compare each element in the first list to each element in the second.

  • If there's a match: remove the match from the second list.
  • If there's no match, add it to a third list that contains the list difference.
  • Make copies of the lists beforehand using .copy() if they need to be preserved.

At the end, move all the remaining elements in the second list to the third list and return it.

This algorithm is not very efficient: O(nm) time if the first list is size n and second list is size m. But it may be necessary given the lack of assumptions we can make about the input.


Alternatively, if we can assume that one-dimensional sub-lists are the only unhashable type we'll see in the input lists, we can circumvent the issue by converting the list into its hashable counterpart: a tuple. We can still use your Counter idea, but now we have to differentiate between (5,) and [5]. Here's my idea:

  • Instead of storing an element x in the Counter, we store the tuple (type(x), x).
  • The exception is if x is a list, in which case we store (type(x), tuple(x)).
  • (5,) stores as (<class 'tuple'>, (5,))
  • [5] stores as (<class 'list'>, (5,))

Now you can simply compare the Counters to find the difference. This is faster than the brute-force solution, but you'll have to keep adding exceptions as you allow more and more unhashable elements in the input lists (like set -> frozenset). With this method, at some point you'll have to lay down some restriction on what the inputs can contain.

  • Related