Home > Back-end >  Finding multiple supersets and subsets for values in a column with python
Finding multiple supersets and subsets for values in a column with python

Time:09-06

I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:

id letter
1 A, B, D, E, F
2 B, C
3 B
4 D, B
5 B, D, A
6 X, Y, Z
7 X, Y
8 E, D
7 G
8 G

For e.g.

  • 'B', 'D,B', 'E,D', 'B,D,A' are subsets of 'A,B,D,E,F',
  • 'B' is a subset of 'B,C',
  • 'X,Y' is a subset of 'X,Y,Z',
  • 'G' is a subset of 'G'.

and

  • 'A,B,D,E,F', 'B,C', 'X,Y,Z' and 'G' are supersets.

I would like to show and store that relation in the separate excel files, first one includes (subsets and their supersets) second one includes supersets, First file:

id letter
1 A, B, D, E, F
5 B,D,A
8 E,D
4 D,B
3 B
2 B,C
3 B
6 X, Y, Z
7 X, Y
7 G
8 G

Second file:

id letter
1 A, B, D, E, F
2 B,C
6 X, Y, Z
7 G

CodePudding user response:

One possible solution could be using itertools.combinations and check in every combination if all elements of the one item is in the other.

To find the supersets we take the letter column and convert it to a list of tuples. Then we create all possible combinations each with two elements of that column. The line a,b = ... is to find the shorter element in that specific combination. a is always the shorter element. If every letter of a is in b and a is in list out, then we remove it from the list because it is a subset of another element. At the end, out only contains the supersets of your data. Then we only have to change the elements of the list to joined strings again and filter the df with that list to get your 2nd file (here called df2)

You need to be aware of how you split your strings in the beginning and also joining in the end. If there leading or trailing whitespaces in your data, you need to strip them, otherwise in the end the filter wouldn't match the rows.

EDIT If you want to get rid of the duplicates at the end, you just need to add .drop_duplicates(subset='letter') at the end after filtering your df2. subset needs to be defined here, since both rows with G have a different value for id, so it wouldn't be considered as duplicate.

df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'letter': ['A, B, D, E, F','B, C','B','D, B','B, D, A','X, Y, Z','X, Y','E, D','G','G']})

lst = df['letter'].values.tolist()
lst = list(tuple(item.strip() for item in x.split(',')) for x in lst)
print(lst)
# [('A', 'B', 'D', 'E', 'F'), ('B', 'C'), ('B',), ('D', 'B'), ('B', 'D', 'A'), ('X', 'Y', 'Z'), ('X', 'Y'), ('E', 'D')]

out = lst[:] #copy of lst

for tup1,tup2 in itertools.combinations(lst, 2):
    a, b = (tup1, tup2) if len(tup1) < len(tup2) else (tup2, tup1)
    # e.g for a,b : (('D','B'), ('B', 'D', 'A'))
    if all(elem in b for elem in a) and a in out:
        out.remove(a)

print(out)
# [('A', 'B', 'D', 'E', 'F'), ('B', 'C'), ('X', 'Y', 'Z')]

filt = list(map(', '.join, out))
df2 = df.loc[df['letter'].isin(filt), :].drop_duplicates(subset='letter')
print(df2)

Output:

   id         letter
0   1  A, B, D, E, F
1   2           B, C
5   6        X, Y, Z
8   9              G
  • Related