Suppose I have the following dataframe where the first column is a facility and the second column is a product produced by the facility.
d = {'Facility': ['A', 'A', 'A', 'B', 'B','B', 'C', 'C', 'C'],
'Product': ['P1', 'P2', 'P3', 'P1', 'P2', 'P4', 'P1', 'P3', 'P5']}
df = pd.DataFrame(data=d)
df
Facility Product
0 A P1
1 A P2
2 A P3
3 B P1
4 B P2
5 B P4
6 C P1
7 C P3
8 C P5
My goal is to figure out how many same product is produced by each pair of facility. For instance, for the given example, Facility A and Facility B produce both P1 and P2, thus, I would like to get two as a results.
I have written the following code snippet, however, its execution is pretty slow. I was wondering if someone could show me a better way to achieve this.
#get all the product information
products = df['Product'].unique()
#create pair dataframe
colNames = ['Fac1', 'Fac2', 'Count']
linkedFacilities = pd.DataFrame(columns = colNames)
for p in products:
#get a sub data frame with the corresponding product
myDf = df[df['Product'].str.contains(p)]
myList = [ (f1, f2)
for f1 in myDf.Facility
for f2 in myDf.Facility if f1 != f2]
temp = pd.DataFrame(data=myList, columns=colNames[:2])
#for each identified product between two facilities, starts adding a column
temp['Count'] = 1
linkedFacilities = pd.concat([linkedFacilities, temp],ignore_index=True)
Then, I accumulate each row and eliminate the duplicates.
grouped = linkedFacilities.groupby(["Fac1", "Fac2"]).size().reset_index()
grouped.rename(index={0: "Fac1", 1: "Fac2", 2: "Count"})
m = pd.DataFrame(np.sort(grouped[['Fac1','Fac2']], axis=1), index=grouped.index).duplicated()
grouped = grouped[~m]
grouped
A B 2
A C 2
B C 1
CodePudding user response:
Here's one option using merge
and value_counts
. Then remove pairs by making the index frozensets and filter out duplicates using duplicated
:
counts = (df.merge(df, on='Product', suffixes=('_1', '_2'))
.query('Facility_1!=Facility_2')
.value_counts(['Facility_1','Facility_2']))
counts = (counts.reset_index(name='Number of shared products')
.loc[pd.Series(counts.index.map(frozenset)).duplicated(keep='last')]
.reset_index(drop=True))
Output:
Facility_1 Facility_2 Number of shared products
0 A B 2
1 A C 2
2 B C 1
CodePudding user response:
Here is an approach using a grouped aggregation:
from itertools import combinations
(df
.groupby('Product')['Facility']
.agg(lambda x: list(map(frozenset, combinations(set(x), r=2))))
.explode()
.value_counts()
)
Output:
(A, C) 2
(A, B) 2
(C, B) 1
Name: Facility, dtype: int64
Alternative, for a consistent ordering of the facilities:
from itertools import combinations
out = (df
.groupby('Product')['Facility']
.agg(lambda x: list(combinations(sorted(set(x)), r=2)))
.explode()
.value_counts()
)
Output:
(A, B) 2
(A, C) 2
(B, C) 1
Name: Facility, dtype: int64