These are the dataframes in the array (The name of this array is clusters):
[ Height Weight
STU3 72.508120 216.218230
STU2 70.183550 201.071918
STU4 71.252986 204.655494,
Height Weight
STU18 64.756280 137.348471
STU11 63.075024 146.905558
STU16 63.981765 147.812869,
Height Weight
STU12 56.810317 84.170695,
Height Weight
STU1 65.270346 168.617746
STU6 65.806248 165.850648
STU7 68.096220 167.747141
STU9 66.166363 165.514607
STU10 67.906850 170.417231,
Height Weight
STU5 65.237050 181.011973
STU8 67.155963 175.646690
STU20 69.443615 178.276728,
Height Weight
STU18 64.756280 137.348471
STU11 63.075024 146.905558
STU16 63.981765 147.812869,
Height Weight
STU17 61.253579 109.681758
STU13 60.916196 120.943248
STU19 60.236390 123.863208
STU14 63.383506 125.662081
STU15 60.822118 127.441434]
This is the dataframe I want to remove from the array above (referred to as biggest_cluster):
[ Height Weight
STU18 64.756280 137.348471
STU11 63.075024 146.905558
STU16 63.981765 147.812869]
So my code snippet to remove this dataframe is as follows:
clusters.remove(biggest_cluster)
The error I get is:
ValueError: Can only compare identically-labeled DataFrame objects
Normally, this error occurs when dataframes are compared. However, in this case, I'm just trying to remove an element (which in this case happens to be a dataframe) from an array (In this case the array stores dataframes).
How can I resolve this issue?
CodePudding user response:
The list.remove
method will indeed compare DataFrames for equality (==
). From the Python docs (emphasis added):
list.remove(x)
Remove the first item from the list whose value is equal to x. It raises a ValueError if there is no such item.
This already fails when comparing the first DataFrame in the list because the row labels are different from biggest_cluster
.
There are many ways to work around this, all of which are a bit longer than .remove
. Rather simple and readable would be a list comprehension that tests for identity instead of equality:
clusters = [cluster for cluster in clusters if cluster.equals(biggest_cluster)]
One disadvantage: The list will exist twice in memory until the garbage collector takes care of it. This could be a problem if the data is huge.
In the context of Pandas DataFrames, comparing with ==
means element-wise comparison which doesn't work if the labels don't match, while pd.DataFrame.equals
, which I used above, checks if the whole DataFrame is the same.
CodePudding user response:
Here is a possible solution using:
df1.drop(df2.index.values,inplace=True)
import pandas as pd
import numpy as np
# Matrix of a few sample values.
my_matrix = np.matrix([
[72.508120,216.218230],
[70.183550, 201.071918],
[71.252986, 137.348471],
[64.756280, 204.655494],
[63.075024, 146.905558]
]
)
index = ['STU1','STU2','STU3','STU4','STU4']
# Instanced as my_df
my_df = pd.DataFrame(data= my_matrix, index= index, columns=['Height','Name'])
# Sample of what big cluster could be
my_other_matrix = np.matrix([
[64.756280, 204.655494],
[63.075024, 146.905558]
]
)
# Now as a df
my_other_df = pd.DataFrame(data= my_other_matrix, index= ['STU1','STU2'], columns=['Height', 'Name'])
# Assume we wish to remove indices STU1 and STU2.
# Drop the values that correspond to the indices of big-cluster
my_df.drop(my_other_df.index.values,inplace=True)
print(my_df)
Note that inplace takes care of assignment.