Match multiple csv columns entries to another csv and extract data in python-CodePudding

I am trying to match csv entries and extract data but stuck. My csv files are in this format:

df1 looks like this:

type prediction ax ay az

df2 looks like this:

type ax ay az x y z fx fy fz

I would like to first match df1 and df2. For this, I need to match ax, ay and az all together with `df2. Matching only single column can give me wrong dataframe because entries are repeated.

After matching multiple columns with df2, I would like to extract those values and make a dataframe with df1.

Expected dataframe:

type prediction ax ay az x y z

df1 and df2 doesn't have same size. Actually, df2 is a huge file that is why I want to extract only required dataset.

This is my code:

def match_dataset(df1, df2):
    df1_new = pd.DataFrame(columns=['x','y','z','fx','fy','fz','az','ax','ay'])
    df2_new = pd.DataFrame(columns=['x','y','z','fx','fy','fz','az','ax','ay'])
    for i in range(len(df1)):
        for j in range(len(df2)):
            if df1.iloc[i]['az'] == df2.iloc[j]['az'] and df1.iloc[i]['ay'] == df2.iloc[j]['ay'] and df1.iloc[i]['ax'] == df2.iloc[j]['ax']:
                df1_new = df1_new.append(df2.iloc[j], ignore_index=True)
                #df2_new = df2_new.append(df2.iloc[j], ignore_index=True)
    return df1_new

data = match_dataset(df1, df2)

print(data.head())

But my code is stuck in loop. It doesn't give me output.

Can I get some help? Thank you.

CodePudding user response：

I think you can use df1.merge() to get the desired output. Here, I'm creating two dfs with your columns and some random digits in each column:

import pandas as pd
import numpy as np

df1_cols = ['type', 'prediction', 'ax', 'ay', 'az']
df2_cols = ['type', 'ax', 'ay', 'az', 'x', 'y', 'z', 'fx', 'fy', 'fz']

df1 = pd.DataFrame(np.random.randint(10,size=(4,len(df1_cols))), columns = df1_cols)
df2 = pd.DataFrame(np.random.randint(10,size=(1000,len(df2_cols))), columns = df2_cols)

In this example, df1 came out like this:

   type  prediction  ax  ay  az
0     8           1   8   2   7
1     3           0   5   4   5
2     7           3   0   0   2
3     2           5   3   5   7

Now apply merge:

df1_new = df1.merge(df2, on=['az','ay','ax'], how='left')
print(df1_new)

result:

   type_x  prediction  ax  ay  az  type_y    x    y    z   fx   fy   fz
0       8           1   8   2   7     3.0  0.0  2.0  6.0  7.0  8.0  9.0
1       3           0   5   4   5     NaN  NaN  NaN  NaN  NaN  NaN  NaN
2       7           3   0   0   2     NaN  NaN  NaN  NaN  NaN  NaN  NaN
3       2           5   3   5   7     9.0  8.0  4.0  0.0  3.0  3.0  1.0
4       2           5   3   5   7     9.0  9.0  1.0  3.0  6.0  5.0  9.0

Apparently, in this random example, we found 2 matches for df1.iloc[3], but zero for df1.iloc[1:3], hence the NA values. We can simply drop these from the df with df1_new.dropna(inplace=True). Finally, reset the index: df1_new.reset_index(drop=True, inplace=True):

   type_x  prediction  ax  ay  az  type_y    x    y    z   fx   fy   fz
0       8           1   8   2   7     3.0  0.0  2.0  6.0  7.0  8.0  9.0
1       2           5   3   5   7     9.0  8.0  4.0  0.0  3.0  3.0  1.0
2       2           5   3   5   7     9.0  9.0  1.0  3.0  6.0  5.0  9.0

To select only the columns mentioned in your def, you can use:

df1_new[['x','y','z','fx','fy','fz','az','ax','ay']]