I have a dataset like the below, multiple groups, completed values with over 200 columns (denoting days)
Input
Series | 1 | 2 | 3 | 4 | 5 | 6 | 7 | GROUP |
---|---|---|---|---|---|---|---|---|
01/08/2021 | 100% | 75% | 60% | 50% | 40% | 30% | 0% | A |
08/08/2021 | 100% | 95% | 80% | 60% | 30% | 10% | 0% | A |
15/08/2021 | 100% | 85% | 60% | 40% | 20% | 10% | 5% | A |
01/08/2021 | 100% | 70% | 65% | 55% | 45% | 35% | 0% | B |
08/08/2021 | 100% | 90% | 80% | 60% | 30% | 10% | 0% | B |
15/08/2021 | 100% | 95% | 60% | 40% | 30% | 20% | 5% | B |
Now, I have an incomplete dataset like the below. I would like to compute similarity metric for each group and state which series is most similar.
For purpose of similarity, I am using CORREL in Excel at the moment and in case of tie, I am using the latest one. For comparison, only complete values in both groups are compared (i.e. so missing values in expected output are not used for similarity metric calculation).
This is a VBA macro which I am shifting to python (either pandas or pyspark).
I am confused on how best to proceed. Any other similarity metric can be tried out too. Thanks
Expected Output
Series | 1 | 2 | 3 | 4 | 5 | 6 | 7 | Similarity_Score | Similarity_Week | Group |
---|---|---|---|---|---|---|---|---|---|---|
01/09/2021 | 39% | 28% | 0% | 0.99 | 01/08/2021 | A | ||||
08/09/2021 | 62% | 44% | 21% | 12% | 7% | 0.99 | 15/08/2021 | A | ||
15/09/2021 | 8% | 0% | 1.00 | 08/08/2021 | A | |||||
15/09/2021 | 30% | 19% | 0% | 1.00 | 15/08/2021 | B |
CodePudding user response:
This solution involves iterating over each group, taking a subset of each dataframe and taking the product of each dataframes values, such that each row can be compared to every other row.
We can use some nested zip/filter/reverse trickery to keep only columns that are filled out. Putting that in a list with the dates from both dfs and the group, we can create a dataframe, sort, group, and keep the top score from each.
Joining this back to the second df should give you the output you want.
import pandas as pd
import numpy as np
from itertools import product
df = pd.DataFrame({'Series': {0: '01/08/2021',
1: '08/08/2021',
2: '15/08/2021',
3: '01/08/2021',
4: '08/08/2021',
5: '15/08/2021'},
'1': {0: '100%', 1: '100%', 2: '100%', 3: '100%', 4: '100%', 5: '100%'},
'2': {0: '75%', 1: '95%', 2: '85%', 3: '70%', 4: '90%', 5: '95%'},
'3': {0: '60%', 1: '80%', 2: '60%', 3: '65%', 4: '80%', 5: '60%'},
'4': {0: '50%', 1: '60%', 2: '40%', 3: '55%', 4: '60%', 5: '40%'},
'5': {0: '40%', 1: '30%', 2: '20%', 3: '45%', 4: '30%', 5: '30%'},
'6': {0: '30%', 1: '10%', 2: '10%', 3: '35%', 4: '10%', 5: '20%'},
'7': {0: '0%', 1: '0%', 2: '5%', 3: '0%', 4: '0%', 5: '5%'},
'GROUP': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'B'}})
df2 = pd.DataFrame({'Series': {0: '01/09/2021',
1: '08/09/2021',
2: '15/09/2021',
3: '15/09/2021'},
'1': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'2': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'3': {0: np.nan, 1: '62%', 2: np.nan, 3: np.nan},
'4': {0: np.nan, 1: '44%', 2: np.nan, 3: np.nan},
'5': {0: '39%', 1: '21%', 2: np.nan, 3: '30%'},
'6': {0: '28%', 1: '12%', 2: '8%', 3: '19%'},
'7': {0: '0%', 1: '7%', 2: '0%', 3: '0%'},
'Similarity_Score': {0: 0.99, 1: 0.99, 2: 1.0, 3: 1.0},
'Similarity_Week': {0: '01/08/2021',
1: '15/08/2021',
2: '08/08/2021',
3: '15/08/2021'},
'Group': {0: 'A', 1: 'A', 2: 'A', 3: 'B'}}
)
df2.drop(columns=['Similarity_Score','Similarity_Week'], inplace=True)
l = []
for g, data in df.groupby('GROUP'):
x = df2.loc[df2['Group']==g]
for c in product(data.values,x.values):
a = c[0][1:-1]
b = c[1][1:-1]
a,b = list(zip(*(zip(reversed(a),list(filter(lambda v: v==v, b))))))
a = [int(x.replace('%',''))/100 for x in a]
b = list(reversed([int(x.replace('%',''))/100 for x in b]))
l.append([g,c[0][0],c[1][0], np.corrcoef(a,b)[1,0]])
out = df2.merge(pd.DataFrame(l, columns=['Group','Similarity_Week','Series','Similarity_Score']).sort_values(by=['Similarity_Score', 'Similarity_Week'], ascending=False).groupby(['Group','Series']).head(1), on=['Group','Series'])
Output
Series 1 2 3 4 5 6 7 Group Similarity_Week \
0 01/09/2021 NaN NaN NaN NaN 39% 28% 0% A 01/08/2021
1 08/09/2021 NaN NaN 62% 44% 21% 12% 7% A 15/08/2021
2 15/09/2021 NaN NaN NaN NaN NaN 8% 0% A 01/08/2021
3 15/09/2021 NaN NaN NaN NaN 30% 19% 0% B 15/08/2021
Similarity_Score
0 0.999405
1 0.999005
2 1.000000
3 0.999286
I believe the scores are very similar for 15/09/2021 group A, such that if you were to round the scores you would get a different most recent date. You can validate this by checking
[x for x in l if x[2]=='15/09/2021' and x[0]=='A']
Yields
[['A', '01/08/2021', '15/09/2021', 1.0],
['A', '08/08/2021', '15/09/2021', 0.9999999999999998],
['A', '15/08/2021', '15/09/2021', 0.9999999999999998]]
So in theory the 15/08/2021 would be the date if you were rounding to a few decimal places, which you could do by putting round()
around the np.corrcoef
CodePudding user response:
If you prefer a solution without for
loops you could merge
the two data frames on Group
, and use groupby
to apply the similarity metric.
Building on the data frames constructed by @Chris:
df.rename(columns={"GROUP":"Group"}, inplace=True)
def similarity(arr1, arr2):
"""Similarity between two arrays of percent strings, nans ignored"""
df = pd.DataFrame({"arr1":arr1, "arr2":arr2}).dropna() \
.apply(lambda s: s.str.strip("%").astype(float)/100)
return df.arr1.corr(df.arr2)
# Convert data columns to array in each row.
df_xformed = df.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
.reset_index().rename(columns={"Series":"df_Series", 0:"df"})
df2_xformed = df2.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
.reset_index().rename(columns={"Series":"df2_Series", 0:"df2"})
# Merge on Group and calculate similarities.
df_combined = df_xformed.merge(df2_xformed, on="Group")
df_combined["similarity"] = df_combined.apply(
lambda row: similarity(row["df"], row["df2"]), axis=1)
# Find max similarity of each df2_Series within its Group.
df_combined["df2_sim_max"] = df_combined.groupby(\
["df2_Series", "Group"])["similarity"] \
.transform(max)
idx = df_combined["similarity"] == df_combined["df2_sim_max"]
result = df_combined[idx][["df2_Series", "Group", "df2", "df_Series", "similarity"]]
result
# df2_Series Group ... df_Series similarity
# 0 01/09/2021 A ... 01/08/2021 0.999405
# 2 15/09/2021 A ... 01/08/2021 1.000000
# 7 08/09/2021 A ... 15/08/2021 0.999005
# 11 15/09/2021 B ... 15/08/2021 0.999286