Home > other >  How do I remove duplicate rows based on all columns?
How do I remove duplicate rows based on all columns?

Time:04-13

I merged 3 dataframes mrna, meth, and cna. I want to remove any duplicate rows that either have the same Hugo_Symbol column value or have the same values across all the remaining columns (i.e., columns starting with "TCGA-").

import re
import pandas as pd


dfs = [mrna, meth, cna]

common = pd.concat(dfs, join='inner')
common["Hugo_Symbol"] = [re.sub(r'\|. ', "", str(i)) for i in common["Hugo_Symbol"]]  # In Hugo_Symbol column, remove everything after the pipe except newline
common = common.drop_duplicates(subset="Hugo_Symbol") # Remove row if Hugo_Symbol is the same
common

A snippet of the dataframe:

common_dict = common.iloc[1:10,1:10].to_dict()
common_dict

{'TCGA-02-0001-01': {1: -0.9099,
  2: -2.3351,
  3: 0.2216,
  4: 0.6798,
  5: -2.48,
  6: 0.7912,
  7: -1.4578,
  8: -3.8009,
  9: 3.4868},
 'TCGA-02-0003-01': {1: 0.0896,
  2: -1.17,
  3: 0.1255,
  4: 0.2374,
  5: -3.2629,
  6: 1.2846,
  7: -1.474,
  8: -2.9891,
  9: -0.1511},
 'TCGA-02-0007-01': {1: -5.6511,
  2: -2.8365,
  3: 2.0026,
  4: -0.6326,
  5: -1.3741,
  6: -3.437,
  7: -1.047,
  8: -4.185,
  9: 2.1816},
 'TCGA-02-0009-01': {1: 0.9795,
  2: -0.5464,
  3: 1.1115,
  4: -0.2128,
  5: -3.3461,
  6: 1.3576,
  7: -1.0782,
  8: -3.4734,
  9: -0.8985},
 'TCGA-02-0010-01': {1: -0.7122,
  2: 0.7651,
  3: 2.4691,
  4: 0.7222,
  5: -1.7822,
  6: -3.3403,
  7: -1.6397,
  8: 0.3424,
  9: 1.7337},
 'TCGA-02-0011-01': {1: -6.8649,
  2: -0.4178,
  3: 0.1858,
  4: -0.0863,
  5: -2.9486,
  6: -3.843,
  7: -0.9275,
  8: -5.0462,
  9: 0.9702},
 'TCGA-02-0014-01': {1: -1.9439,
  2: 0.3727,
  3: -0.5368,
  4: -0.1501,
  5: 0.8977,
  6: 0.5138,
  7: -1.688,
  8: 0.1778,
  9: 1.7975},
 'TCGA-02-0021-01': {1: -0.8761,
  2: -0.2532,
  3: 2.0574,
  4: -0.9708,
  5: -1.0883,
  6: -1.0698,
  7: -0.8684,
  8: -5.3854,
  9: 1.2353},
 'TCGA-02-0024-01': {1: 1.6237,
  2: -0.717,
  3: -0.4517,
  4: -0.5276,
  5: -2.3993,
  6: -4.3485,
  7: 0.0811,
  8: -2.5217,
  9: 0.1883}}

Now, I want to drop any duplicate rows by subsetting all the columns beginning with "TCGA-" (i.e., all except the Hugo_Symbol column). How do I do it?

common = common.drop_duplicates(subset=[1:,], keep="first", inplace=False, ignore_index=False)

CodePudding user response:

Here is the example data to reproduce the problem. It needed some changes to the data from dict of OP to have duplicates.

 df = pd.DataFrame({
    'Hugo_Symbol': ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'ABC', 'GHI', 'XYZ', 'DEF', 'BBB', 'CCC'], 
    'TCGA-02-0001-01': [-0.9099, -2.3351, 0.2216, 0.6798, -2.48, 0.7912, -1.4578, -3.8009, 3.4868, -2.48, 3.4868], 
    'TCGA-02-0003-01': [0.0896, -1.17, 0.1255, 0.2374, -3.2629, 1.2846, -1.474, -2.9891, -0.1511, -3.2629, -0.1511], 
    'TCGA-02-0007-01': [-5.6511, -2.8365, 2.0026, -0.6326, -1.3741, -3.437, -1.047, -4.185, 2.1816, -1.3741, 2.1816], 
    'TCGA-02-0009-01': [0.9795, -0.5464, 1.1115, -0.2128, -3.3461, 1.3576, -1.0782, -3.4734, -0.8985, -3.3461, -0.8985], 
    'TCGA-02-0010-01': [-0.7122, 0.7651, 2.4691, 0.7222, -1.7822, -3.3403, -1.6397, 0.3424, 1.7337, -1.7822, 1.7337], 
    'TCGA-02-0011-01': [-6.8649, -0.4178, 0.1858, -0.0863, -2.9486, -3.843, -0.9275, -5.0462, 0.9702, -2.9486, 0.9702], 
    'TCGA-02-0014-01': [-1.9439, 0.3727, -0.5368, -0.1501, 0.8977, 0.5138, -1.688, 0.1778, 1.7975, 0.8977, 1.7975], 
    'TCGA-02-0021-01': [-0.8761, -0.2532, 2.0574, -0.9708, -1.0883, -1.0698, -0.8684, -5.3854, 1.2353, -1.0883, 1.2353], 
    'TCGA-02-0024-01': [1.6237, -0.717, -0.4517, -0.5276, -2.3993, -4.3485, 0.0811, -2.5217, 0.1883, -2.3993, 0.1883]})

We have some duplicates in the "Hugo_Symbol" column and the last two rows (different hugo symbol) have exactly same data as the rows at position 5 and 9.

With the ideas of @Code Different I created a mask and used it on the DataFrame.

tcga_cols = df.columns[df.columns.str.startswith("TCGA-")].to_list()
mask = df.duplicated("Hugo_Symbol") | df.duplicated(tcga_cols)

print(mask)
False False False False False True True False True True True 

result = df[~mask]
print(result)

  Hugo_Symbol  TCGA-02-0001-01  TCGA-02-0003-01  TCGA-02-0007-01  TCGA-02-0009-01  TCGA-02-0010-01  TCGA-02-0011-01  TCGA-02-0014-01  TCGA-02-0021-01  TCGA-02-0024-01
0         ABC          -0.9099           0.0896          -5.6511           0.9795          -0.7122          -6.8649          -1.9439          -0.8761           1.6237
1         DEF          -2.3351          -1.1700          -2.8365          -0.5464           0.7651          -0.4178           0.3727          -0.2532          -0.7170
2         GHI           0.2216           0.1255           2.0026           1.1115           2.4691           0.1858          -0.5368           2.0574          -0.4517
3         JKL           0.6798           0.2374          -0.6326          -0.2128           0.7222          -0.0863          -0.1501          -0.9708          -0.5276
4         MNO          -2.4800          -3.2629          -1.3741          -3.3461          -1.7822          -2.9486           0.8977          -1.0883          -2.3993
7         XYZ          -3.8009          -2.9891          -4.1850          -3.4734           0.3424          -5.0462           0.1778          -5.3854          -2.5217

As you can see result only contains rows where the mask was False

EDIT: I tested the logic on several cases and it seems to work just fine (for this example data) so I guess your real data has some format which causes problems.

For example if your columns have leading whitespaces str.startswith won't work properly. As a workaround, do ALL your columns start with TCGA except the "hugo" column? Then you could just replace the tcga_cols line with:

tcga_cols = df.columns[1:]
  • Related