I am trying to find whether a row exists in a DataFrame based on the values of all columns. I believe I found a solution, but I'm having problems after saving and loading the DataFrame into/from a .csv file.
In the following example, I iterate over each row of the DataFrame, and find the index corresponding to each row -- i.e. the row where all columns are identical to the row being queried).
NB: In my real code, I iterate over a smaller DataFrame and search for rows in a larger DataFrame. But the issue happens in both cases.
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]]) # Create data frame
df.to_csv(my_filename, index=False) # Save to csv
df1 = pd.read_csv(my_filename) # Load from csv
# Find original data in loaded data
for row_idx, this_row in df.iterrows():
print(np.where((df == this_row).all(axis=1))) # This returns the correct index
for row_idx, this_row in df.iterrows():
print(np.where((df1 == this_row).all(axis=1))) # This returns an empty index, and a FutureWarning
The output is:
(array([0]),)
(array([1]),)
(array([], dtype=int64),)
(array([], dtype=int64),)
tmp.py:25: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right`
After some debugging, I found that the DataFrame loaded from csv is not identical to the original DataFrame:
# The DataFrames look identical, but comparing gives me a ValueError:
df
df1
df == df1
The output is:
0 1
0 1 2
1 3 4
0 1
0 1 2
1 3 4
Traceback (most recent call last):
File "tmp.py", line 30, in <module>
df == df1
File "python3.9/site-packages/pandas/core/ops/common.py", line 69, in new_method
return method(self, other)
File "python3.9/site-packages/pandas/core/arraylike.py", line 32, in __eq__
return self._cmp_method(other, operator.eq)
File "python3.9/site-packages/pandas/core/frame.py", line 6851, in _cmp_method
self, other = ops.align_method_FRAME(self, other, axis, flex=False, level=None)
File "python3.9/site-packages/pandas/core/ops/__init__.py", line 288, in align_method_FRAME
raise ValueError(
ValueError: Can only compare identically-labeled DataFrame objects
- Note: This appears to be related to a similar question, but the proposed solution, namely specifying the index labels, did not solve my problem.
Thanks in advance.
CodePudding user response:
If you are iterating through a data frame I would recommend you to transform your df into a dictionary.
df_dict = df.to_dict('records')
It is much faster as this great article details.
Now you can enumerate through df_dict and match it to your desired data.
target_values = {'col1': 'foo', 'col2': 'bar', ...}
for i, row in enumerate(df_dict):
if row == target_values:
match_index = i
Maybe also a good idea would be to start by matching only one column and if it matches check if everything else is identical too.