how to find how many common features that two columns share in a data frame in Python-CodePudding

Suppose I have the following dataframe where the first column is a facility and the second column is a product produced by the facility.

d = {'Facility': ['A', 'A', 'A', 'B', 'B','B', 'C', 'C', 'C'],
 'Product': ['P1', 'P2', 'P3', 'P1', 'P2', 'P4', 'P1', 'P3', 'P5']}
df = pd.DataFrame(data=d)
df

Facility    Product
0   A   P1
1   A   P2
2   A   P3
3   B   P1
4   B   P2
5   B   P4
6   C   P1
7   C   P3
8   C   P5

My goal is to figure out how many same product is produced by each pair of facility. For instance, for the given example, Facility A and Facility B produce both P1 and P2, thus, I would like to get two as a results.

I have written the following code snippet, however, its execution is pretty slow. I was wondering if someone could show me a better way to achieve this.

#get all the product information
products = df['Product'].unique()
#create pair dataframe
colNames  =   ['Fac1', 'Fac2', 'Count']
linkedFacilities = pd.DataFrame(columns = colNames)
for p in products:
        #get a sub data frame with the corresponding product
        myDf = df[df['Product'].str.contains(p)]
        myList = [ (f1, f2)
            for f1 in myDf.Facility
            for f2 in myDf.Facility if f1 != f2]
        temp = pd.DataFrame(data=myList, columns=colNames[:2])
        #for each identified product between two facilities, starts adding a column
        temp['Count'] = 1
        linkedFacilities = pd.concat([linkedFacilities, temp],ignore_index=True)

Then, I accumulate each row and eliminate the duplicates.

grouped = linkedFacilities.groupby(["Fac1", "Fac2"]).size().reset_index()
grouped.rename(index={0: "Fac1", 1: "Fac2", 2: "Count"})
m = pd.DataFrame(np.sort(grouped[['Fac1','Fac2']], axis=1), index=grouped.index).duplicated()
grouped = grouped[~m]
grouped 
A   B   2
A   C   2
B   C   1

CodePudding user response：

Here's one option using merge and value_counts. Then remove pairs by making the index frozensets and filter out duplicates using duplicated:

counts = (df.merge(df, on='Product', suffixes=('_1', '_2'))
          .query('Facility_1!=Facility_2')
          .value_counts(['Facility_1','Facility_2']))
counts = (counts.reset_index(name='Number of shared products')
          .loc[pd.Series(counts.index.map(frozenset)).duplicated(keep='last')]
          .reset_index(drop=True))

Output:

  Facility_1 Facility_2  Number of shared products
0          A          B                          2
1          A          C                          2
2          B          C                          1

CodePudding user response：

Here is an approach using a grouped aggregation:

from itertools import combinations

(df
 .groupby('Product')['Facility']
 .agg(lambda x: list(map(frozenset, combinations(set(x), r=2))))
 .explode()
 .value_counts()
 )

Output:

(A, C)    2
(A, B)    2
(C, B)    1
Name: Facility, dtype: int64

Alternative, for a consistent ordering of the facilities:

from itertools import combinations

out = (df
 .groupby('Product')['Facility']
 .agg(lambda x: list(combinations(sorted(set(x)), r=2)))
 .explode()
 .value_counts()
 )

Output:

(A, B)    2
(A, C)    2
(B, C)    1
Name: Facility, dtype: int64