I have two dataframes (20000 rows) with combinations. I want to loop through df2 with df1 and count the occurence of combinations. For example the code should be able to recognize that the combination "B,D" from df1 occurces 3 times with these combinations of df2; "A,B,D" "B,C,D" "A,B,C,D". So the code should be able to look into the string and recognize that the letters "B,D" are in "A,B,C,D". I put my code below, what can I change about it so it counts the combinations right?
d1 = {'Combinations' : ['A,B', 'A,C', 'A,D', 'B,C', 'B,D', 'C,D', 'A,B,C', 'A,C,D', 'B,C,D', 'A,B,D', 'A,B,C,D',]}
df1 = pd.DataFrame(data=d1)
d2 = {'Combinations' : ['A,B,D', 'A,C,D', 'B,C,D', 'A,B,C,D', 'A,D']}
df2 = pd.DataFrame(data=d2)
counts = []
for i, row in df1.iterrows():
values = row['Combinations'].split(',')
count = df2.Combinations.str.contains('|'.join(values)).sum()
counts.append(count)
df1['counts'] = counts
Output from loop:
Combinations counts
0 A,B 5
1 A,C 5
2 A,D 5
3 B,C 4
4 B,D 5
5 C,D 5
6 A,B,C 5
7 A,C,D 5
8 B,C,D 5
9 A,B,D 5
10 A,B,C,D 5
Instead I want this (the correct) output from the loop:
Combinations counts
0 A,B 2
1 A,C 2
2 A,D 4
3 B,C 2
4 B,D 3
5 C,D 3
6 A,B,C 1
7 A,C,D 2
8 B,C,D 2
9 A,B,D 2
10 A,B,C,D 1
CodePudding user response:
Another solution is to convert the df2.Combinations
to a set
and then test the elements against this set:
df1.Combinations = df1.Combinations.str.split(",")
df2.Combinations = df2.Combinations.str.split(",").apply(set)
data = []
for l in df1.Combinations:
data.append(sum(c.issuperset(l) for c in df2.Combinations))
df1["counts"] = data
print(df1)
Prints:
Combinations counts
0 [A, B] 2
1 [A, C] 2
2 [A, D] 4
3 [B, C] 2
4 [B, D] 3
5 [C, D] 3
6 [A, B, C] 1
7 [A, C, D] 2
8 [B, C, D] 2
9 [A, B, D] 2
10 [A, B, C, D] 1
CodePudding user response:
you can try something like this
count = 0
for i in df1.Combinations:
for j in df2.Combinations:
if i in j:
count = 1
print( i "appears" str(count) "times")