Home > Enterprise >  Concat two CSVs with different number of rows returning warning
Concat two CSVs with different number of rows returning warning

Time:03-09

I need to concat two CSVs with a different number of rows. I receive a warning when I do so and the results in print(df1.head()) return an empty dataframe when I know by visual inspection that there are matching rows.

Could the warning affect the results reflected in head()? Or is there another issue I need to resolve?

# columns to filter on
columns_to_match = ['column1', 'column2', 'column3', 'column4', 'column5']

# find matching rows based off columns in filter 
filter = df1[columns_to_match].eq(df2[columns_to_match]).all(axis=1)

# concat filtered results
df1 = pd.concat([df1[columns_to_match], df2[columns_to_match]])

# print matching results 
print(df1.head())

I receive the following error:

UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df1 = pd.concat([df1[filter], df2[filter]])

Thank you.

CodePudding user response:

Firstly, you have multiple uses for filter, which is confusing and error-prone. I believe that the warning arises because filter = df1[filter].eq(df2[filter]).all(axis=1) is a series aligned to df1. However you then try to subset df2 using the same mask series.

One way to fix it would be to have e.g. filter2 = df2[cols].eq(df1[cols]).all(axis=1) and then do df1 = pd.concat([df1[filter], df2[filter2]])

(note that I am defining cols = ['column1', 'column2', 'column3', 'column4', 'column5'] here.

Edit - this should work:

# find matching rows based off columns in filter 
filter1 = df1[columns_to_match].eq(df2[columns_to_match]).all(axis=1)
filter2 = df2[columns_to_match].eq(df1[columns_to_match]).all(axis=1)

# concat filtered results
df1 = pd.concat([df1[filter1], df2[filter2]])

CodePudding user response:

Thank you to @ags29 and @not_speshal. Both of your suggestions led me to this solution.

# columns to filter on
columns_to_match = ['column1', 'column2', 'column3', 'column4', 'column5']

# merge filtered results)
df1 = pd.merge(df1, df2, on=columns_to_match)

print(df1.head())
  • Related