ValueError after saving and loading pandas DataFrame to csv-CodePudding

I am trying to find whether a row exists in a DataFrame based on the values of all columns. I believe I found a solution, but I'm having problems after saving and loading the DataFrame into/from a .csv file.

In the following example, I iterate over each row of the DataFrame, and find the index corresponding to each row -- i.e. the row where all columns are identical to the row being queried).

NB: In my real code, I iterate over a smaller DataFrame and search for rows in a larger DataFrame. But the issue happens in both cases.

import pandas  as pd

df = pd.DataFrame([[1, 2], [3, 4]])         # Create data frame
df.to_csv(my_filename, index=False)         # Save to csv
df1 = pd.read_csv(my_filename)              # Load from csv

# Find original data in loaded data
for row_idx, this_row in df.iterrows():
    print(np.where((df  == this_row).all(axis=1)))    # This returns the correct index

for row_idx, this_row in df.iterrows():
    print(np.where((df1 == this_row).all(axis=1)))    # This returns an empty index, and a FutureWarning

The output is:

(array([0]),)
(array([1]),)
(array([], dtype=int64),)
(array([], dtype=int64),)
tmp.py:25: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version.  Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right`

After some debugging, I found that the DataFrame loaded from csv is not identical to the original DataFrame:

# The DataFrames look identical, but comparing gives me a ValueError:
df
df1
df == df1

The output is:

   0  1
0  1  2
1  3  4

   0  1
0  1  2
1  3  4

Traceback (most recent call last):

  File "tmp.py", line 30, in <module>
    df == df1

  File "python3.9/site-packages/pandas/core/ops/common.py", line 69, in new_method
    return method(self, other)

  File "python3.9/site-packages/pandas/core/arraylike.py", line 32, in __eq__
    return self._cmp_method(other, operator.eq)

  File "python3.9/site-packages/pandas/core/frame.py", line 6851, in _cmp_method
    self, other = ops.align_method_FRAME(self, other, axis, flex=False, level=None)

  File "python3.9/site-packages/pandas/core/ops/__init__.py", line 288, in align_method_FRAME
    raise ValueError(

ValueError: Can only compare identically-labeled DataFrame objects

Note: This appears to be related to a similar question, but the proposed solution, namely specifying the index labels, did not solve my problem.

Thanks in advance.

CodePudding user response：

If you are iterating through a data frame I would recommend you to transform your df into a dictionary.

df_dict = df.to_dict('records')

It is much faster as this great article details.

Now you can enumerate through df_dict and match it to your desired data.

    target_values = {'col1': 'foo', 'col2': 'bar', ...}
    for i, row in enumerate(df_dict):
          if row == target_values:
                match_index = i

Maybe also a good idea would be to start by matching only one column and if it matches check if everything else is identical too.