Add columns to pandas dataframe from two separate dataframes with condition-CodePudding

I'll admit that this question is quite specific. I'm trying to write a function that reads two time columns (same label) in separate dataframes df1['gps'] and df2['gps']. I want to look for elements in the first column which are close to those in the second column, not necessarily in same row. When the condition on time distance is met, I want to save the close elements in df1['gps'] and df1['gps'] in a new dataframe called coinc in separate columns coinc['gps1'] and coinc['gps2'] in the fastest and most efficient way. This is my code:

def find_coinc(df1, df2=None, tdelta=.25, shift=0):
    index_boolean = False
    if df2 is None:
        df2 = df1.copy()

    coincs = pd.DataFrame()
    for _, r1 in tqdm(df1.iterrows(), total=len(df1)):
        ctrig = df2.loc[abs(r1.gps shift-df2.gps)<tdelta]
        print(r1.gps)
        coincs_single = pd.DataFrame()
        if len(ctrig)>0:
            coincs_single['gps1'] = r1.gps
            coincs_single['gps2'] = ctrig.gps                               
            coincs = pd.concat((coincs, coincs_single), axis = 0, ignore_index=index_boolean)                
            index_boolean=True
        else:
            pass
    return coincs

The script runs fine, but when investigating the output, I find that one column of coinc is all NaN and I don't understand why. Test case with generated data:

a = pd.DataFrame()    #define dataframes and fill them
b = pd.DataFrame()
a['gps'] = [0.12, 0.13, 0.6, 0.7]
b['gps'] = [0.1, 0.3, 0.5, 0.81, 0.82, 0.83]

find_coinc(a, b, 0.16, 0)

The output yielded is:

    gps1 gps2
0   NaN 0.10
1   NaN 0.10
2   NaN 0.50
3   NaN 0.81
4   NaN 0.82
5   NaN 0.83

How can I write coinc so that both columns turn out fine?

CodePudding user response：

Well, here is another solution. Instead of concat two dataframes just add new rows to 'coincs' DataFrame. I will show you below.

def find_coinc(df1, df2=None, tdelta=.25, shift=0):
    if df2 is None:
        df2 = df1.copy()

    coincs = pd.DataFrame(columns=['gps1', 'gps2'])
    for _, r1 in tqdm(df1.iterrows(), total=len(df1)):
        ctrig = df2.loc[abs(r1.gps shift-df2.gps) < tdelta]
        if len(ctrig)>0:
            for ctrig_value in ctrig['gps']:
                # Add n rows based on 'ctrig' length.
                coincs.loc[len(coincs)] = [r1.gps, ctrig_value]
        else:
            pass
    return coincs

# -------------------

a = pd.DataFrame()  # define dataframes and fill them
b = pd.DataFrame()
a['gps'] = [0.12, 0.13, 0.6, 0.7]
b['gps'] = [0.1, 0.3, 0.5, 0.81, 0.82, 0.83]

coins = find_coinc(a, b, 0.16, 0)

print('\n\n')
print(coins.to_string())

Result:

   gps1  gps2
0  0.12  0.10
1  0.13  0.10
2  0.60  0.50
3  0.70  0.81
4  0.70  0.82
5  0.70  0.83

I hope I could help! :D

CodePudding user response：

So the issue is that there are multiple elements in df2['gps'] which satisfy the condition of being within a time window of df1['gps']. I think I found a solution, but looking for a better one if possible. Highlighting the modified line in the original function as ### FIX UPDATE comment:

def find_coinc(df1, df2=None, tdelta=.25, shift=0):
    index_boolean = False
    if df2 is None:
        df2 = df1.copy()

    coincs = pd.DataFrame()
    for _, r1 in tqdm(df1.iterrows(), total=len(df1)):
        ctrig = df2.loc[abs(r1.gps shift-df2.gps)<tdelta]
        ctrig.reset_index(drop=True, inplace=True)
        coincs_single = pd.DataFrame()
        if len(ctrig)>0:
            coincs_single['gps1'] = [r1.gps]*len(ctrig)  ### FIX UPDATE
            coincs_single['gps2'] = ctrig.gps 
            print(ctrig.gps)
            coincs = pd.concat((coincs, coincs_single), axis = 0, ignore_index=index_boolean)
            index_boolean=True
        else:
            pass
    return coincs

The solution I chose, since I want to have all the instances of the condition being met, was to write the same element in df1['gps'] into coinc['gps1'] the needed amount of times.