I have a series of 1D arrays of different lengths greater than 1.
I would like to find in s
the the numbers that appear together in more than one array and in how many arrays do they appear together.
import numpy as np
import pandas as pd
a=np.array([1,2,3])
b=np.array([])
c=np.array([2,3,4,5,6])
d=np.array([2,3,4,5,6,9,15])
e=np.arra([5,6])
s=pd.Series([a,b,c,d,e])
In this example the desired outcome would be sth like
{[2,3]:3, [5,6]:3, [2,3,4,5,6]:2]}
The expected result does not need to be a dictionary but any structure that contains this information.
Also i would have to to that for >200 series like s so performance also matters for me
I have tried
result=s.value_counts()
but i cant figure out how to proceed
CodePudding user response:
I think what you are missing here is a way to build the combinations of numbers present in each array, to then be able to count how many times each combination appears. To do that you can use stuff like the built-in itertools
module:
from itertools import combinations
import numpy as np
a = np.array([1,2,3])
for c in combinations(a, 2):
print(c)
>>> (1, 2)
>>> (1, 3)
>>> (2, 3)
So using this, you can then build a series for each length and check how many times each combination of length 2 happens, how many times each combination of length 3 happens and so on.
import numpy as np
import pandas as pd
a=np.array([1,2,3])
b=np.array([])
c=np.array([2,3,4,5,6])
d=np.array([2,3,4,5,6,9,15])
e=np.array([5,6])
all_arrays = a, b, c, d, e
maxsize = max(array.size for array in all_arrays)
for length in range(2, maxsize 1):
length_N_combs = pd.Series(x for array in all_arrays for x in combinations(array, length) if array.size >= length)
counts = length_N_combs.value_counts()
print(counts[counts>1])
From here you can format the output however you like. Note that you have to exclude arrays that are too short. I'm using a generator comprehension for a slight increase in efficiency, but note that this algorithm is not gonna be cheap anyway, you need a lot of comparisons. Generator comprehensions are a way to condense a generator expression into a one liner (and much more than that). In this case, the above nested comprehension is roughly equivalent to defining a generator that yields from the generator that combinations
returns and calling that generator to build the pandas Series. Something like this will give you the same result:
def length_N_combs_generator(arrays, length):
for array in arrays:
if array.size >= length:
yield from combinations(array, length)
for length in range(2, maxsize 1):
s = pd.Series(length_N_combs_generator(all_arrays, length))
counts = s.value_counts()
print(counts[counts>1])
CodePudding user response:
You can use set
operations:
from itertools import combinations
from collections import Counter
s2 = s.apply(frozenset).sort_values(key=lambda s: s.str.len(), ascending=False)
c = Counter(x for a,b in combinations(s2, 2) if len(x:=a&b))
# increment all values by 1
for k in c:
c[k] = 1
dict(c)
Output:
{frozenset({2, 3, 4, 5, 6}): 2, frozenset({2, 3}): 3, frozenset({5, 6}): 3}