Home > Software engineering >  Comparison of lists and removing repetitions
Comparison of lists and removing repetitions

Time:06-02

I try to write a script taking lists of different size as input, and giving as output the longest lists of the input including the characters of the shortest lists.

I have put the list in a dataframe, and use a script that loop through all the values of the dataframe to see if the same characters are present in the same lists, and printing the longest if there is a match.

lists = [['a','b','g'], ['a','c','d','e','g'], ['a','b'], ['b', 'd', 'f'], ['a', 'c']]
df = pd.DataFrame(lists)

Define number rows:

nber_rows=len(df.index)

Looping through the dataframe to find matches between the lists:

> listnorep=[] for f in range(nber_rows):
>         row1 = df.iloc[f].dropna().tolist();
>         list_intersection=[]
>         for g in range(nber_rows):
>             row2 = df.iloc[g].dropna().tolist();
>             check = all( elem in row2 for elem in row1);
>             if check == True:
>                 list_intersection.append(row2);
>         if list_intersection:
>             listnorep.append(list_intersection);
>         else:
>             listnorep.append(row1); listnorep

The desired output in this example is:

a b g None None
a c d e g
b d f

CodePudding user response:

You can use set operations. If any set is < to another one, let's not select it:

# aggregate as set (after stacking to drop the NaNs)
s = df.stack().groupby(level=0).agg(set)
# keep rows that do not have any sweet greater than them
df[[not any(a<b for b in s) for a in s]]

Output:

   0  1  2     3     4
0  a  b  g  None  None
1  a  c  d     e     g
3  b  d  f  None  None
  • Related