I want to read into three omics datasets - methylation, cnv, and mRNA expression.
First, I removed duplicated columns and sort by index for all 3 dataframes mrna
, meth
, and cna
separately. These dataframes are then concatenated to form one dataframe common
.
Eventually, I want the common
dataframe to only contain "Hugo_Symbol" column that are the same in all three dataframes mrna
, meth
, and cna
.
import re
dfs = [mrna, meth, cna]
common = pd.concat(dfs, join='inner')
Here, the common
dataframe has a 48625 rows × 348 columns
dimension.
Now, I only want to keep the rows in common
if the row value is the same across all the three original dataframes mrna
, cna
, and meth
.
keep_col = common.drop_duplicates(keep="first")["Hugo_Symbol"]
common = common[common.set_index(["Hugo_Symbol"]).index.isin(keep_col)]
I'm expecting the newly subsetted common
dataframe to have # of rows equal to the length of keep_col
len(keep_col)
48618
However, the subsetted common
dataframe instead has a 48625 rows × 348 columns
dimension, meaning that none of the rows were removed/subsetted.
Example dataframes:
mrna
Hugo_Symbol | TCGA-1 | TCGA-2 | TCGA-3 | |
---|---|---|---|---|
0 | ABC | 123 | 456 | 789 |
1 | DEF | 678 | 187 | 456 |
2 | MNO | 742 | 147 | 147 |
3 | VWX | 184 | 195 | 268 |
cna
Hugo_Symbol | TCGA-1 | TCGA-2 | TCGA-3 | |
---|---|---|---|---|
0 | DEF | 456 | 123 | 321 |
1 | ABC | 123 | 456 | 789 |
2 | GHI | 694 | 284 | 270 |
3 | JKL | 384 | 843 | 147 |
meth
Hugo_Symbol | TCGA-1 | TCGA-2 | TCGA-3 | |
---|---|---|---|---|
0 | DEF | 456 | 123 | 432 |
1 | PQR | 285 | 226 | 732 |
2 | STU | 225 | 632 | 532 |
3 | ABC | 123 | 456 | 789 |
In this example, the rows with Hugo_Symbol ABC
and DEF
are retained because all three dataframes have these Hugo_Symbols. Hence, the common
dataframe, as (previously) sorted by index, would be:
common
Hugo_Symbol | TCGA-1 | TCGA-2 | TCGA-3 | |
---|---|---|---|---|
0 | ABC | 123 | 456 | 789 |
1 | DEF | 456 | 123 | 321 |
CodePudding user response:
What exactly would the ideal output look like, is this what you're trying to do? I feel like there's some confusion between columns and rows...
dfs = [mrna, meth]
def do_stuff(df):
df = pd.DataFrame(df).T
df.columns = df.iloc[0]
df = df[1:]
return df
mrna, meth = map(do_stuff, dfs)
print(pd.concat([mrna, meth], axis=1).dropna())
Output:
Hugo_Symbol AACS FSTL1 ELMO2 CREB3L1 RPS11 PNMA1 MMP2 \
TCGA-02-0001-01 -0.5909 -0.9099 -2.3351 0.2216 0.6798 -2.48 0.7912
TCGA-02-0003-01 -0.5154 0.0896 -1.17 0.1255 0.2374 -3.2629 1.2846
TCGA-02-0007-01 0.6932 -5.6511 -2.8365 2.0026 -0.6326 -1.3741 -3.437
Hugo_Symbol SAMD4A SMARCD3 A4GNT MEOX2 COX8C IMPA2 \
TCGA-02-0001-01 -1.4578 -3.8009 3.4868 0.041451 0.981055 0.01858
TCGA-02-0003-01 -1.474 -2.9891 -0.1511 0.10389 0.989163 0.127843
TCGA-02-0007-01 -1.047 -4.185 2.1816 0.025519 0.985666 0.018484
Hugo_Symbol TTC8 TMEM186 RETSAT TULP1 TAF15 CCDC88B \
TCGA-02-0001-01 0.037853 0.858835 0.028956 0.895705 0.024163 0.64503
TCGA-02-0003-01 0.061488 0.718749 0.037052 0.875983 0.02025 0.338662
TCGA-02-0007-01 0.084045 0.862945 0.029969 0.930821 0.02238 0.319357
Hugo_Symbol EPB41L3
TCGA-02-0001-01 0.025546
TCGA-02-0003-01 0.879701
TCGA-02-0007-01 0.019343
CodePudding user response:
Given that the entire row must be equal, as by my first answer below returns:
import pandas as pd
mrna = {
'Hugo_Symbol': ['ABC','DEF','MNO','VWX'],
'TCGA-1': [123, 678, 742, 184,],
'TCGA-2': [456, 187, 147, 195,],
'TCGA-3': [789, 456, 147, 268,],
}
cna = {
'Hugo_Symbol': ['DEF','ABC','GHI','JKL',],
'TCGA-1': [456, 123, 694, 384,],
'TCGA-2': [123, 456, 284, 843,],
'TCGA-3': [321, 789, 270, 147,],
}
meth = {
'Hugo_Symbol': ['DEF','PQR','STU','ABC',],
'TCGA-1': [456, 285, 225, 123,],
'TCGA-2': [123, 226, 632, 456,],
'TCGA-3': [432, 732, 532, 789,],
}
dfs = [pd.DataFrame.from_dict(mrna), pd.DataFrame.from_dict(cna), pd.DataFrame.from_dict(meth)]
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()])\
for df in dfs]), columns=dfs[0].columns)
print(f"common: \n{common}")
"""
common:
Hugo_Symbol TCGA-1 TCGA-2 TCGA-3
0 ABC 123 456 789
"""
Given that only 'Hugo_Symbol' column is relevant, below returns: import pandas as pd
mrna = {
'Hugo_Symbol': ['ABC','DEF','MNO','VWX'],
'TCGA-1': [123, 678, 742, 184,],
'TCGA-2': [456, 187, 147, 195,],
'TCGA-3': [789, 456, 147, 268,],
}
cna = {
'Hugo_Symbol': ['DEF','ABC','GHI','JKL',],
'TCGA-1': [456, 123, 694, 384,],
'TCGA-2': [123, 456, 284, 843,],
'TCGA-3': [321, 789, 270, 147,],
}
meth = {
'Hugo_Symbol': ['DEF','PQR','STU','ABC',],
'TCGA-1': [456, 285, 225, 123,],
'TCGA-2': [123, 226, 632, 456,],
'TCGA-3': [432, 732, 532, 789,],
}
dfs = [pd.DataFrame.from_dict(mrna), pd.DataFrame.from_dict(cna), pd.DataFrame.from_dict(meth)]
hugoFilter = set.intersection(*[set(df.loc[:, 'Hugo_Symbol'].values.tolist()) for df in dfs])
common = pd.concat(dfs)
common = common[common.Hugo_Symbol.isin(hugoFilter)].drop_duplicates(keep="first", subset='Hugo_Symbol')
print(f"common: \n{common}")
"""
common:
Hugo_Symbol TCGA-1 TCGA-2 TCGA-3
0 ABC 123 456 789
1 DEF 678 187 456
NOTE: make sure the DEF from the corrrect DF is returned, values differ.
"""