Home > Software design >  Why is my loop, that has to count substrings, not working correctly?
Why is my loop, that has to count substrings, not working correctly?

Time:12-19

I have two dataframes (20000 rows) with combinations. I want to loop through df2 with df1 and count the occurence of combinations. For example the code should be able to recognize that the combination "B,D" from df1 occurces 3 times with these combinations of df2; "A,B,D" "B,C,D" "A,B,C,D". So the code should be able to look into the string and recognize that the letters "B,D" are in "A,B,C,D". I put my code below, what can I change about it so it counts the combinations right?

d1 = {'Combinations' : ['A,B', 'A,C', 'A,D', 'B,C', 'B,D', 'C,D', 'A,B,C', 'A,C,D', 'B,C,D', 'A,B,D', 'A,B,C,D',]}
df1 = pd.DataFrame(data=d1)

d2 = {'Combinations' : ['A,B,D', 'A,C,D', 'B,C,D', 'A,B,C,D', 'A,D']}
df2 = pd.DataFrame(data=d2)

counts = []

for i, row in df1.iterrows():
    values = row['Combinations'].split(',')

    count = df2.Combinations.str.contains('|'.join(values)).sum()
    counts.append(count)
    
df1['counts'] = counts

Output from loop:

   Combinations  counts
0          A,B       5
1          A,C       5
2          A,D       5
3          B,C       4
4          B,D       5
5          C,D       5
6        A,B,C       5
7        A,C,D       5
8        B,C,D       5
9        A,B,D       5
10     A,B,C,D       5

Instead I want this (the correct) output from the loop:

   Combinations  counts
0          A,B       2
1          A,C       2
2          A,D       4
3          B,C       2
4          B,D       3
5          C,D       3
6        A,B,C       1
7        A,C,D       2
8        B,C,D       2
9        A,B,D       2
10     A,B,C,D       1

CodePudding user response:

Another solution is to convert the df2.Combinations to a set and then test the elements against this set:

df1.Combinations = df1.Combinations.str.split(",")
df2.Combinations = df2.Combinations.str.split(",").apply(set)

data = []
for l in df1.Combinations:
    data.append(sum(c.issuperset(l) for c in df2.Combinations))

df1["counts"] = data

print(df1)

Prints:

    Combinations  counts
0         [A, B]       2
1         [A, C]       2
2         [A, D]       4
3         [B, C]       2
4         [B, D]       3
5         [C, D]       3
6      [A, B, C]       1
7      [A, C, D]       2
8      [B, C, D]       2
9      [A, B, D]       2
10  [A, B, C, D]       1

CodePudding user response:

you can try something like this

count = 0
for i in df1.Combinations:
   for j in df2.Combinations:
     if i in j:
        count  = 1
   print( i "appears"   str(count)   "times")
  • Related