Home > Software engineering >  How do I drop columns in a pandas dataframe that exist in another dataframe?
How do I drop columns in a pandas dataframe that exist in another dataframe?

Time:05-05

How do I drop columns in raw_clin if the same columns already exist in raw_clinical_sample? Using isin raised a cannot compute isin with a duplicate axis error.

Explanation of the code:

I want to merge raw_clinical_patient and raw_clinical_sample dataframes. However, the SAMPLE_ID column in raw_clinical_sample should be relabeled as PATIENT_ID before the merge (because it was wrongly labelled). I want the new PATIENT_ID to be the index of raw_clin.

import pandas as pd

    # Clinical patient info
    raw_clinical_patient = pd.read_csv("./gbm_tcga/data_clinical_patient.txt", sep="\t", header=4)
    raw_clinical_patient["PATIENT_ID"] = raw_clinical_patient["PATIENT_ID"].replace()
    raw_clinical_patient.set_index("PATIENT_ID", inplace=True)
    raw_clinical_patient.sort_index()
    
    # Clinical sample info
    raw_clinical_sample = pd.read_csv("./gbm_tcga/data_clinical_sample.txt", sep="\t", header=4)
    raw_clinical_sample.set_index("PATIENT_ID", inplace=True)
    raw_clinical_sample = raw_clinical_sample[raw_clinical_sample.index.isin(raw_clinical_patient.index)]
    
    # Get the actual patient ID from the `raw_clinical_sample` dataframe
    # Drop "PATIENT_ID" and rename "SAMPLE_ID" as "PATIENT_ID" and set as index
    raw_clin = raw_clinical_patient.merge(raw_clinical_sample, on="PATIENT_ID", how="left").reset_index().drop(["PATIENT_ID"], axis=1)
    raw_clin.rename(columns={'SAMPLE_ID':'PATIENT_ID'}, inplace=True)
    raw_clin.set_index('PATIENT_ID', inplace=True)
    

Now, I want to drop all the columns in raw_clinical_sample since the only columns that are needed were the PATIENT_ID and SAMPLE_ID columns.

    # Drop columns that exist in `raw_clinical_sample`
    raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]

Traceback:

ValueError                                Traceback (most recent call last)
<ipython-input-60-45e2e83ddc00> in <module>()
     18 
     19 # Drop columns that exist in `raw_clinical_sample`
---> 20 raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in isin(self, values)
  10514         elif isinstance(values, DataFrame):
  10515             if not (values.columns.is_unique and values.index.is_unique):
> 10516                 raise ValueError("cannot compute isin with a duplicate axis.")
  10517             return self.eq(values.reindex_like(self))
  10518         else:

ValueError: cannot compute isin with a duplicate axis.

CodePudding user response:

You have many ways to do this.

For example using isin:

new_df1 = df1.loc[:, ~df1.columns.isin(df2.columns)]

or with drop:

new_df1 = df1.drop(columns=df1.columns.intersection(df2.columns))

example input:

df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])

output:

pd.DataFrame(columns=['A', 'C', 'D'])

CodePudding user response:

You can use set operations for your application like this:

df1 = pd.DataFrame()
df1['string'] = ['Hello', 'Hi', 'Hola']
df1['number'] = [1, 2, 3]
df2 = pd.DataFrame()
df2['string'] = ['Hello', 'Hola']
df2['number'] = [1, 5]

ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))

df_out = pd.DataFrame(list(ds1.difference(ds2)))
df_out.columns = df1.columns

print(df_out)

Output:

  string  number
0   Hola       3
1     Hi       2

Inspired by: https://stackoverflow.com/a/18184990/7509907

Edit:

Sorry I didn't notice you need to drop the columns. For that, you can use the following: (using mozway's dummy example)

df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])

ds1 = set(df1.columns)
ds2 = set(df2.columns)

cols = ds1.difference(ds2)

df = df1[cols]

print(df)

Output:

Empty DataFrame
Columns: [C, A, D]
Index: []
  • Related