This is hard to explain, So let me get straight into explaining the data and code.
I have a data frame with each row containing all possible combinations of the characters A, B and C. Order of the elements is not taken into consideration.
| vars_set |
| ---------|
| {b, c, a} |
| {b, c} |
| {c, a} |
| {a} |
| {b} |
| {c} |
I also have a list of sets with the same character combination but there can be duplicates here and not all possible combination will be in this list.
actual_combos=[{'a', 'b', 'c'},{'a'},{'a'},{'b'},{'b', 'c'},{'a', 'c'},{'a', 'c'}]
What I would like to do is take each row from the data frame and check how many of the above sets in actual_combos have all elements present within that specific combination of the dataframe row value.
So using the example data, the output would look like this.
| vars_set | combo_count |
| ---------| ------------|
| {b, c, a}|7|
| {b, c} |2|
| {c, a} |4|
| {b, a} |3|
| {a} |2|
| {b} |1|
| {c} |0|
All sets from the list in actual combos are present in (b, c, a) so the count is 7.
The count for (c) is 0 since none of the sets in actual_combos have the value C only. I think you get the point...
Now, my real data set is much larger and has 17 unique elements and not just 3 like in the example. I actually have a working piece of code, but the issue is it's quite slow.
So I'm looking for some help with optimizing my code, so it can run faster. Here is my full code.
import pandas as pd
import numpy as np
from itertools import combinations
input = ['a','b','c']
output = sum([list(map(list, combinations(input, i))) for i in range(len(input) 1)], [])
df_all_combs = pd.DataFrame(output)
df_all_combs['vars_ls']=df_all_combs.values.tolist()
df_all_combs['vars_ls'] = [[x for x in inner_list if x is not None] for inner_list in df_all_combs['vars_ls']]
df_all_combs['combo_count']=""
df_all_combs['vars_set'] = df_all_combs['vars_ls'].apply(set)
actual_combos=[{'a', 'b', 'c'},{'a'},{'a'},{'b'},{'b', 'c'},{'a', 'c'},{'a', 'c'}]
for i in range(df_all_combs.shape[0]):
count=0
for item in actual_combos:
if len(item-df_all_combs['vars_set'].iloc[i]) == 0:
count=count 1
df_all_combs['combo_count'].iloc[i] = count
df_all_combs
CodePudding user response:
I only slightly changed your data creation part, as I assume that it's not the main interest here. Just two little remarks: Please don't overwrite input
, it's a Python built-in. And you don't have to initialize columns like this df_all_combs['combo_count']=""
, create them when you need them.
I think the interesting part is where you compare the sets to each other. Generally, you should avoid looping over the rows of a dataframe, it's slow. I got rid of the for loop by writing a small function (get_combo_count
) and applying it to the column 'vars_set'
(using the pandas.Series.map
method).
When you look inside the function, you can see that I also got rid of the second for loop and instead apply my own little lambda
function using Python's map
function. The output of this function is a list-like containing booleans, which can be passed to the sum
function, thus avoiding the need of a count
variable.
On my machine, the revised code runs about 4-5 times faster than the original code.
import pandas as pd
import numpy as np
from itertools import combinations
my_input = ['a','b','c']
output = sum([list(map(list, combinations(my_input , i))) for i in range(len(my_input ) 1)], [])
df_all_combs = pd.DataFrame(output)
df_all_combs['vars_ls']=df_all_combs.values.tolist()
df_all_combs['vars_ls'] = [[x for x in inner_list if x is not None] for inner_list in df_all_combs['vars_ls']]
df_all_combs['vars_set'] = df_all_combs['vars_ls'].apply(set)
actual_combos=[{'a', 'b', 'c'},{'a'},{'a'},{'b'},{'b', 'c'},{'a', 'c'},{'a', 'c'}]
def get_combo_count(row):
return sum(map(lambda x: x.issubset(row), actual_combos))
df_all_combs['combo_count'] = df_all_combs['vars_set'].map(get_combo_count)
df_all_combs
CodePudding user response:
Using the powerset recipe from itertools
and set.issubset()
:
counts = {}
for input_psi in powerset(inputs):
counts[input_psi] = sum(item.issubset(input_psi) for item in actual_combos)
This seems ~300x faster (tested up to 20 characters), but this may vary based on actual input.
import pandas as pd
# Helpers
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
# From: https://docs.python.org/3/library/itertools.html#itertools-recipes
from itertools import chain, combinations
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s) 1))
# Inputs
inputs = ['a','b','c']
actual_combos = [{'a', 'b', 'c'},{'a'},{'a'},{'b'},{'b', 'c'},{'a', 'c'},{'a', 'c'}]
# Build Dataframe
counts = {}
for input_psi in powerset(inputs):
counts[input_psi] = sum(item.issubset(input_psi) for item in actual_combos)
df = pd.DataFrame(
[(set(k), v) for (k,v) in counts.items()],
columns=['vars_set', 'combo_count']
)
print(df)