I am loading three.csv files into three different Pandas data frames. The number of rows in each file should be the same, but it is not. Sometimes one file contains four to five additional rows compared to another. Sometimes two files include two to three extra rows compared to one file. I'm concatenating all files row by row, so I'm removing these additional rows to make them the same length. I wrote the following code to accomplish this.
df_ch = pd.read_csv("./file1.csv")
df_wr = pd.read_csv("./file2.csv")
df_an = pd.read_csv("./file3.csv")
# here df_wr have less number of rows than df_ch and df_an, so dropping rows from other two frames (df_ch and df_an)
df_ch.drop(df_ch.index[274299:274301], inplace=True)
df_an.drop(df_an.index[274299], inplace=True)
I did it manually in the above code, and now I want to do it automatically. One method is to use if-else to check the length of all frames and make it equal to the shortest length frame. But I'm curious whether Pandas has a faster technique to compare these frames and delete the additional rows.
CodePudding user response:
Couple of way you can do:
In general,
#take the intersection of all the index:
common = df1.index.intersection(df2.index).intersection(df3.index)
pd.concat([df1, df2, df3], axis=1).reindex(common)
Or in your case, maybe just
max_rows = min(map(len, [df1,df2,df3]))
pd.concat([df1, df2, df3], axis=1).iloc[:max_rows]
Update: best option should be join
:
df1.join(df2, how='inner').join(df3, how='inner')