Home > Net >  How to merge for loop output dataframes into one with python?
How to merge for loop output dataframes into one with python?

Time:03-15

I have 2 dataframes as following:

dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

And my output is to generate a new dataframe with column B in dfb and another column of distance(e.g. Hamming distance from AC to AA is 1) between every element from B to A, like this:

   B    disB  disB disB
0  AC    1    2    1
1  BC    2    1    1
2  CC    2    2    0

The codes I have tried like this (courtesy of other posts):

dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

df_summary = dfb.copy()

for seq1 in dfa.A:
    df__ = []
    for seq2 in dfb.B:
        hd = sum(c1 != c2 for c1, c2 in zip(seq1, seq2))
        df__.append(hd)

    df_summary['dis_{}'.format(column)] = pd.DataFrame({'dis_'   column: df__}).values
    print(df_summary)

The result will give me 3 outputs:

    B  dis_B
0  AC      1
1  BC      2
2  CC      2
    B  dis_B
0  AC      2
1  BC      1
2  CC      2
    B  dis_B
0  AC      1
1  BC      1
2  CC      0

but I need to combine them into one, like:

   B    disB  disB disB
0  AC    1    2    1
1  BC    2    1    1
2  CC    2    2    0

Thanks for your help!

CodePudding user response:

A vectorized (read "much faster") solution:

a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
b = np.array(dfb['B'].str.split('').str[1:-1].tolist())

dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)

Output:

>>> dfb
    B  disB_1  disB_2  disB_3
0  AC       1       2       1
1  BC       2       1       1
2  CC       2       2       0

CodePudding user response:

Here's an answer that gives a result in a slightly different form than the question frames things, but uses the values of 'A' and 'B' as the index and columns of the dataframe result, which may be more descriptive of the ultimate result:

import pandas as pd

lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)

Output:

    AA  BB  CC
AC   1   2   1
BC   2   1   1
CC   2   2   0

Here is a performance comparison between the above approach creating a general matrix and a solution using numpy shown in another answer which uses hardcoded column names:

import pandas as pd
import numpy as np

lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)


dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

def foo(dfa, dfb):
    df = pd.DataFrame(data=[[sum(c != d for c, d in zip(dfb['B'][i], dfa['A'][j])) for j in range(len(dfa['A']))] for i in range(len(dfb['B']))], index=dfb['B'], columns=dfa['A'])
    return df
    


def bar(dfa, dfb):
    a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
    b = np.array(dfb['B'].str.split('').str[1:-1].tolist())
    dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)
    return dfb

import timeit

print("\nGeneral matrix approach:")
t = timeit.timeit(lambda: foo(dfa, dfb), number = 100)
print(f"timeit: {t}")

print("\nHarcoded columns approach:")
t = timeit.timeit(lambda: bar(dfa, dfb), number = 100)
print(f"timeit: {t}")

Output and performance via timeit:

    AA  BB  CC
AC   1   2   1
BC   2   1   1
CC   2   2   0

General matrix approach:
timeit: 0.023536499997135252

Harcoded columns approach:
timeit: 0.03922149998834357

This seems to show that the numpy approach takes about 1.5-2x as long as the general matrix approach in this answer.

  • Related