I have 2 dataframes as following:
dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))
And my output is to generate a new dataframe with column B in dfb and another column of distance(e.g. Hamming distance from AC to AA is 1) between every element from B to A, like this:
B disB disB disB
0 AC 1 2 1
1 BC 2 1 1
2 CC 2 2 0
The codes I have tried like this (courtesy of other posts):
dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))
df_summary = dfb.copy()
for seq1 in dfa.A:
df__ = []
for seq2 in dfb.B:
hd = sum(c1 != c2 for c1, c2 in zip(seq1, seq2))
df__.append(hd)
df_summary['dis_{}'.format(column)] = pd.DataFrame({'dis_' column: df__}).values
print(df_summary)
The result will give me 3 outputs:
B dis_B
0 AC 1
1 BC 2
2 CC 2
B dis_B
0 AC 2
1 BC 1
2 CC 2
B dis_B
0 AC 1
1 BC 1
2 CC 0
but I need to combine them into one, like:
B disB disB disB
0 AC 1 2 1
1 BC 2 1 1
2 CC 2 2 0
Thanks for your help!
CodePudding user response:
A vectorized (read "much faster") solution:
a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
b = np.array(dfb['B'].str.split('').str[1:-1].tolist())
dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)
Output:
>>> dfb
B disB_1 disB_2 disB_3
0 AC 1 2 1
1 BC 2 1 1
2 CC 2 2 0
CodePudding user response:
Here's an answer that gives a result in a slightly different form than the question frames things, but uses the values of 'A' and 'B' as the index
and columns
of the dataframe result, which may be more descriptive of the ultimate result:
import pandas as pd
lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)
Output:
AA BB CC
AC 1 2 1
BC 2 1 1
CC 2 2 0
Here is a performance comparison between the above approach creating a general matrix and a solution using numpy
shown in another answer which uses hardcoded column names:
import pandas as pd
import numpy as np
lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)
dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))
def foo(dfa, dfb):
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(dfb['B'][i], dfa['A'][j])) for j in range(len(dfa['A']))] for i in range(len(dfb['B']))], index=dfb['B'], columns=dfa['A'])
return df
def bar(dfa, dfb):
a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
b = np.array(dfb['B'].str.split('').str[1:-1].tolist())
dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)
return dfb
import timeit
print("\nGeneral matrix approach:")
t = timeit.timeit(lambda: foo(dfa, dfb), number = 100)
print(f"timeit: {t}")
print("\nHarcoded columns approach:")
t = timeit.timeit(lambda: bar(dfa, dfb), number = 100)
print(f"timeit: {t}")
Output and performance via timeit
:
AA BB CC
AC 1 2 1
BC 2 1 1
CC 2 2 0
General matrix approach:
timeit: 0.023536499997135252
Harcoded columns approach:
timeit: 0.03922149998834357
This seems to show that the numpy
approach takes about 1.5-2x as long as the general matrix approach in this answer.