Similarity between time series groups-CodePudding

I have a dataset like the below, multiple groups, completed values with over 200 columns (denoting days)

Input

Series	1	2	3	4	5	6	7	GROUP
01/08/2021	100%	75%	60%	50%	40%	30%	0%	A
08/08/2021	100%	95%	80%	60%	30%	10%	0%	A
15/08/2021	100%	85%	60%	40%	20%	10%	5%	A
01/08/2021	100%	70%	65%	55%	45%	35%	0%	B
08/08/2021	100%	90%	80%	60%	30%	10%	0%	B
15/08/2021	100%	95%	60%	40%	30%	20%	5%	B

Now, I have an incomplete dataset like the below. I would like to compute similarity metric for each group and state which series is most similar.

For purpose of similarity, I am using CORREL in Excel at the moment and in case of tie, I am using the latest one. For comparison, only complete values in both groups are compared (i.e. so missing values in expected output are not used for similarity metric calculation).

This is a VBA macro which I am shifting to python (either pandas or pyspark).

I am confused on how best to proceed. Any other similarity metric can be tried out too. Thanks

Expected Output

Series	3	4	5	6	7	Similarity_Score	Similarity_Week	Group
01/09/2021			39%	28%	0%	0.99	01/08/2021	A
08/09/2021	62%	44%	21%	12%	7%	0.99	15/08/2021	A
15/09/2021				8%	0%	1.00	08/08/2021	A
15/09/2021			30%	19%	0%	1.00	15/08/2021	B

CodePudding user response：

This solution involves iterating over each group, taking a subset of each dataframe and taking the product of each dataframes values, such that each row can be compared to every other row.

We can use some nested zip/filter/reverse trickery to keep only columns that are filled out. Putting that in a list with the dates from both dfs and the group, we can create a dataframe, sort, group, and keep the top score from each.

Joining this back to the second df should give you the output you want.

import pandas as pd
import numpy as np
from itertools import product

df = pd.DataFrame({'Series': {0: '01/08/2021',
  1: '08/08/2021',
  2: '15/08/2021',
  3: '01/08/2021',
  4: '08/08/2021',
  5: '15/08/2021'},
 '1': {0: '100%', 1: '100%', 2: '100%', 3: '100%', 4: '100%', 5: '100%'},
 '2': {0: '75%', 1: '95%', 2: '85%', 3: '70%', 4: '90%', 5: '95%'},
 '3': {0: '60%', 1: '80%', 2: '60%', 3: '65%', 4: '80%', 5: '60%'},
 '4': {0: '50%', 1: '60%', 2: '40%', 3: '55%', 4: '60%', 5: '40%'},
 '5': {0: '40%', 1: '30%', 2: '20%', 3: '45%', 4: '30%', 5: '30%'},
 '6': {0: '30%', 1: '10%', 2: '10%', 3: '35%', 4: '10%', 5: '20%'},
 '7': {0: '0%', 1: '0%', 2: '5%', 3: '0%', 4: '0%', 5: '5%'},
 'GROUP': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'B'}})

df2 = pd.DataFrame({'Series': {0: '01/09/2021',
  1: '08/09/2021',
  2: '15/09/2021',
  3: '15/09/2021'},
 '1': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
 '2': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
 '3': {0: np.nan, 1: '62%', 2: np.nan, 3: np.nan},
 '4': {0: np.nan, 1: '44%', 2: np.nan, 3: np.nan},
 '5': {0: '39%', 1: '21%', 2: np.nan, 3: '30%'},
 '6': {0: '28%', 1: '12%', 2: '8%', 3: '19%'},
 '7': {0: '0%', 1: '7%', 2: '0%', 3: '0%'},
 'Similarity_Score': {0: 0.99, 1: 0.99, 2: 1.0, 3: 1.0},
 'Similarity_Week': {0: '01/08/2021',
  1: '15/08/2021',
  2: '08/08/2021',
  3: '15/08/2021'},
 'Group': {0: 'A', 1: 'A', 2: 'A', 3: 'B'}}
)

df2.drop(columns=['Similarity_Score','Similarity_Week'], inplace=True)


l = []
for g, data in df.groupby('GROUP'):
    x = df2.loc[df2['Group']==g]
    for c in product(data.values,x.values):
        a = c[0][1:-1]
        b = c[1][1:-1]
        a,b = list(zip(*(zip(reversed(a),list(filter(lambda v: v==v, b))))))
        a = [int(x.replace('%',''))/100 for x in a]
        b = list(reversed([int(x.replace('%',''))/100 for x in b]))
        l.append([g,c[0][0],c[1][0], np.corrcoef(a,b)[1,0]])

out = df2.merge(pd.DataFrame(l, columns=['Group','Similarity_Week','Series','Similarity_Score']).sort_values(by=['Similarity_Score', 'Similarity_Week'], ascending=False).groupby(['Group','Series']).head(1), on=['Group','Series'])

Output

       Series   1   2    3    4    5    6   7 Group Similarity_Week  \
0  01/09/2021 NaN NaN  NaN  NaN  39%  28%  0%     A      01/08/2021   
1  08/09/2021 NaN NaN  62%  44%  21%  12%  7%     A      15/08/2021   
2  15/09/2021 NaN NaN  NaN  NaN  NaN   8%  0%     A      01/08/2021   
3  15/09/2021 NaN NaN  NaN  NaN  30%  19%  0%     B      15/08/2021   

   Similarity_Score  
0          0.999405  
1          0.999005  
2          1.000000  
3          0.999286

I believe the scores are very similar for 15/09/2021 group A, such that if you were to round the scores you would get a different most recent date. You can validate this by checking

[x for x in l if x[2]=='15/09/2021' and x[0]=='A']

Yields

[['A', '01/08/2021', '15/09/2021', 1.0],
 ['A', '08/08/2021', '15/09/2021', 0.9999999999999998],
 ['A', '15/08/2021', '15/09/2021', 0.9999999999999998]]

So in theory the 15/08/2021 would be the date if you were rounding to a few decimal places, which you could do by putting round() around the np.corrcoef

CodePudding user response：

If you prefer a solution without for loops you could merge the two data frames on Group, and use groupby to apply the similarity metric.

Building on the data frames constructed by @Chris:

df.rename(columns={"GROUP":"Group"}, inplace=True)

def similarity(arr1, arr2):
        """Similarity between two arrays of percent strings, nans ignored"""

    df = pd.DataFrame({"arr1":arr1, "arr2":arr2}).dropna() \
            .apply(lambda s: s.str.strip("%").astype(float)/100)
    return df.arr1.corr(df.arr2)

# Convert data columns to array in each row.
df_xformed = df.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
                .reset_index().rename(columns={"Series":"df_Series", 0:"df"})
df2_xformed = df2.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
                .reset_index().rename(columns={"Series":"df2_Series", 0:"df2"})

# Merge on Group and calculate similarities.
df_combined = df_xformed.merge(df2_xformed, on="Group")
df_combined["similarity"] = df_combined.apply(
    lambda row: similarity(row["df"], row["df2"]), axis=1)

# Find max similarity of each df2_Series within its Group.
df_combined["df2_sim_max"] = df_combined.groupby(\
                                 ["df2_Series", "Group"])["similarity"] \
                                .transform(max)

idx = df_combined["similarity"] == df_combined["df2_sim_max"]
result = df_combined[idx][["df2_Series", "Group", "df2", "df_Series", "similarity"]]
result
#     df2_Series Group  ...   df_Series similarity
# 0   01/09/2021     A  ...  01/08/2021   0.999405
# 2   15/09/2021     A  ...  01/08/2021   1.000000
# 7   08/09/2021     A  ...  15/08/2021   0.999005
# 11  15/09/2021     B  ...  15/08/2021   0.999286