I merged 3 dataframes mrna
, meth
, and cna
. I want to remove any duplicate rows that either have the same Hugo_Symbol
column value or have the same values across all the remaining columns (i.e., columns starting with "TCGA-"). I also want to remove any rows that are identical across all the columns except for one column.
import re
import pandas as pd
dfs = [mrna, meth, cna]
common = pd.concat(dfs, join='inner')
common["Hugo_Symbol"] = [re.sub(r'\|. ', "", str(i)) for i in common["Hugo_Symbol"]] # In Hugo_Symbol column, remove everything after the pipe except newline
tcga_cols = common.columns[1:]
mask = common.duplicated("Hugo_Symbol") | common.duplicated(tcga_cols)
common = common[~mask]
common
A snippet of the dataframe:
common_dict = common.iloc[1:10,1:10].to_dict()
common_dict
{'TCGA-02-0001-01': {1: -0.9099,
2: -2.3351,
3: 0.2216,
4: 0.6798,
5: -2.48,
6: 0.7912,
7: -1.4578,
8: -3.8009,
9: 3.4868},
'TCGA-02-0003-01': {1: 0.0896,
2: -1.17,
3: 0.1255,
4: 0.2374,
5: -3.2629,
6: 1.2846,
7: -1.474,
8: -2.9891,
9: -0.1511},
'TCGA-02-0007-01': {1: -5.6511,
2: -2.8365,
3: 2.0026,
4: -0.6326,
5: -1.3741,
6: -3.437,
7: -1.047,
8: -4.185,
9: 2.1816},
'TCGA-02-0009-01': {1: 0.9795,
2: -0.5464,
3: 1.1115,
4: -0.2128,
5: -3.3461,
6: 1.3576,
7: -1.0782,
8: -3.4734,
9: -0.8985},
'TCGA-02-0010-01': {1: -0.7122,
2: 0.7651,
3: 2.4691,
4: 0.7222,
5: -1.7822,
6: -3.3403,
7: -1.6397,
8: 0.3424,
9: 1.7337},
'TCGA-02-0011-01': {1: -6.8649,
2: -0.4178,
3: 0.1858,
4: -0.0863,
5: -2.9486,
6: -3.843,
7: -0.9275,
8: -5.0462,
9: 0.9702},
'TCGA-02-0014-01': {1: -1.9439,
2: 0.3727,
3: -0.5368,
4: -0.1501,
5: 0.8977,
6: 0.5138,
7: -1.688,
8: 0.1778,
9: 1.7975},
'TCGA-02-0021-01': {1: -0.8761,
2: -0.2532,
3: 2.0574,
4: -0.9708,
5: -1.0883,
6: -1.0698,
7: -0.8684,
8: -5.3854,
9: 1.2353},
'TCGA-02-0024-01': {1: 1.6237,
2: -0.717,
3: -0.4517,
4: -0.5276,
5: -2.3993,
6: -4.3485,
7: 0.0811,
8: -2.5217,
9: 0.1883}}
Now, I want to remove any rows that are identical across all the columns except for one column.
Hugo_Symbol | TCGA-1 | TCGA-2 | TCGA-3 |
---|---|---|---|
First | 0.123 | 0.234 | 0.345 |
Second | 0.123 | 0.234 | 0.478 |
Third | 0.456 | 0.678 | 0.789 |
Fourth | 0.789 | 0.456 | 0.321 |
In this example dataframe, the First
and Second
row have duplicate values in the all but one "TCGA-" column. This is the type of row that I want to remove.
CodePudding user response:
IIUC, you can compute the duplicates per column using apply(pd.Series.duplicated)
, then count the True values per row and compare it to the wanted threshold:
ncols = df.shape[1]-1
ndups = df.drop(columns='Hugo_Symbol').apply(pd.Series.duplicated).sum(axis=1)
df2 = df[ndups.lt(ncols-1)]
output (using the simple provided example):
Hugo_Symbol TCGA-1 TCGA-2 TCGA-3
0 First 0.123 0.234 0.345
2 Third 0.456 0.678 0.789
3 Fourth 0.789 0.456 0.321
There is however one potential blind spot. Imagine this dataset:
A B C D
X X C D
A B X X
The first row won't be dropped as it comes first and has duplicates spread over several other rows (that might not be an issue in your case).