I am trying to match csv entries and extract data but stuck. My csv files are in this format:
df1 looks like this:
type prediction ax ay az
df2 looks like this:
type ax ay az x y z fx fy fz
I would like to first match df1
and df2
. For this, I need to match ax
, ay
and az
all together with `df2. Matching only single column can give me wrong dataframe because entries are repeated.
After matching multiple columns with df2, I would like to extract those values and make a dataframe with df1.
Expected dataframe:
type prediction ax ay az x y z
df1 and df2 doesn't have same size. Actually, df2 is a huge file that is why I want to extract only required dataset.
This is my code:
def match_dataset(df1, df2):
df1_new = pd.DataFrame(columns=['x','y','z','fx','fy','fz','az','ax','ay'])
df2_new = pd.DataFrame(columns=['x','y','z','fx','fy','fz','az','ax','ay'])
for i in range(len(df1)):
for j in range(len(df2)):
if df1.iloc[i]['az'] == df2.iloc[j]['az'] and df1.iloc[i]['ay'] == df2.iloc[j]['ay'] and df1.iloc[i]['ax'] == df2.iloc[j]['ax']:
df1_new = df1_new.append(df2.iloc[j], ignore_index=True)
#df2_new = df2_new.append(df2.iloc[j], ignore_index=True)
return df1_new
data = match_dataset(df1, df2)
print(data.head())
But my code is stuck in loop. It doesn't give me output.
Can I get some help? Thank you.
CodePudding user response:
I think you can use df1.merge()
to get the desired output. Here, I'm creating two dfs with your columns and some random digits in each column:
import pandas as pd
import numpy as np
df1_cols = ['type', 'prediction', 'ax', 'ay', 'az']
df2_cols = ['type', 'ax', 'ay', 'az', 'x', 'y', 'z', 'fx', 'fy', 'fz']
df1 = pd.DataFrame(np.random.randint(10,size=(4,len(df1_cols))), columns = df1_cols)
df2 = pd.DataFrame(np.random.randint(10,size=(1000,len(df2_cols))), columns = df2_cols)
In this example, df1 came out like this:
type prediction ax ay az
0 8 1 8 2 7
1 3 0 5 4 5
2 7 3 0 0 2
3 2 5 3 5 7
Now apply merge:
df1_new = df1.merge(df2, on=['az','ay','ax'], how='left')
print(df1_new)
result:
type_x prediction ax ay az type_y x y z fx fy fz
0 8 1 8 2 7 3.0 0.0 2.0 6.0 7.0 8.0 9.0
1 3 0 5 4 5 NaN NaN NaN NaN NaN NaN NaN
2 7 3 0 0 2 NaN NaN NaN NaN NaN NaN NaN
3 2 5 3 5 7 9.0 8.0 4.0 0.0 3.0 3.0 1.0
4 2 5 3 5 7 9.0 9.0 1.0 3.0 6.0 5.0 9.0
Apparently, in this random example, we found 2 matches for df1.iloc[3]
, but zero for df1.iloc[1:3]
, hence the NA values. We can simply drop these from the df with df1_new.dropna(inplace=True)
. Finally, reset the index: df1_new.reset_index(drop=True, inplace=True)
:
type_x prediction ax ay az type_y x y z fx fy fz
0 8 1 8 2 7 3.0 0.0 2.0 6.0 7.0 8.0 9.0
1 2 5 3 5 7 9.0 8.0 4.0 0.0 3.0 3.0 1.0
2 2 5 3 5 7 9.0 9.0 1.0 3.0 6.0 5.0 9.0
To select only the columns mentioned in your def
, you can use:
df1_new[['x','y','z','fx','fy','fz','az','ax','ay']]