"Can only compare identically-labeled DataFrame objects" error even though DataFrame objec-CodePudding

These are the dataframes in the array (The name of this array is clusters):

[         Height      Weight
STU3  72.508120  216.218230
STU2  70.183550  201.071918
STU4  71.252986  204.655494,
           Height      Weight
STU18  64.756280  137.348471
STU11  63.075024  146.905558
STU16  63.981765  147.812869,
           Height     Weight
STU12  56.810317  84.170695,
           Height      Weight
STU1   65.270346  168.617746
STU6   65.806248  165.850648
STU7   68.096220  167.747141
STU9   66.166363  165.514607
STU10  67.906850  170.417231,
           Height      Weight
STU5   65.237050  181.011973
STU8   67.155963  175.646690
STU20  69.443615  178.276728,
           Height      Weight
STU18  64.756280  137.348471
STU11  63.075024  146.905558
STU16  63.981765  147.812869,
           Height      Weight
STU17  61.253579  109.681758
STU13  60.916196  120.943248
STU19  60.236390  123.863208
STU14  63.383506  125.662081
STU15  60.822118  127.441434]

This is the dataframe I want to remove from the array above (referred to as biggest_cluster):

[         Height      Weight
STU18  64.756280  137.348471
STU11  63.075024  146.905558
STU16  63.981765  147.812869]

So my code snippet to remove this dataframe is as follows:

clusters.remove(biggest_cluster)

The error I get is: ValueError: Can only compare identically-labeled DataFrame objects

Normally, this error occurs when dataframes are compared. However, in this case, I'm just trying to remove an element (which in this case happens to be a dataframe) from an array (In this case the array stores dataframes).

How can I resolve this issue?

CodePudding user response：

The list.remove method will indeed compare DataFrames for equality (==). From the Python docs (emphasis added):

list.remove(x)

Remove the first item from the list whose value is equal to x. It raises a ValueError if there is no such item.

This already fails when comparing the first DataFrame in the list because the row labels are different from biggest_cluster.

There are many ways to work around this, all of which are a bit longer than .remove. Rather simple and readable would be a list comprehension that tests for identity instead of equality:

clusters = [cluster for cluster in clusters if cluster.equals(biggest_cluster)]

One disadvantage: The list will exist twice in memory until the garbage collector takes care of it. This could be a problem if the data is huge.

In the context of Pandas DataFrames, comparing with == means element-wise comparison which doesn't work if the labels don't match, while pd.DataFrame.equals, which I used above, checks if the whole DataFrame is the same.

CodePudding user response：

Here is a possible solution using:

df1.drop(df2.index.values,inplace=True)

import pandas as pd
import numpy as np

# Matrix of a few sample values.
my_matrix = np.matrix([
                        [72.508120,216.218230],
                        [70.183550, 201.071918],
                        [71.252986, 137.348471],
                        [64.756280, 204.655494],
                        [63.075024, 146.905558]
                        ]
                        )
index = ['STU1','STU2','STU3','STU4','STU4']

#  Instanced as my_df
my_df = pd.DataFrame(data= my_matrix, index= index,  columns=['Height','Name'])


# Sample of what big cluster could be
my_other_matrix = np.matrix([
                            [64.756280, 204.655494],
                            [63.075024, 146.905558]
                            ]
                            )

# Now as a df
my_other_df = pd.DataFrame(data= my_other_matrix, index= ['STU1','STU2'], columns=['Height', 'Name'])


# Assume we wish to remove indices STU1 and STU2.
# Drop the values that correspond to the indices of big-cluster

my_df.drop(my_other_df.index.values,inplace=True)

print(my_df)

Note that inplace takes care of assignment.