I want to create a conditional column type
where:
- If the last part of the index starts with
01
to09
, the label istumor
(for example, TCGA-06-0125-02A istumor
) - Otherwise, label as
non-tumor
(e.g., TCGA-06-0125-12A isnon-tumor
)
Code:
import numpy as np
import pandas as pd
# 01-09 : tumor
# 10-19 : normal
# Color the PCA plot by tumor vs non-tumor
condition = meth_450.loc[meth_450.index.contains('01') | meth_450.index.str.contains('02') | meth_450.index.str.contains('03') | meth_450.index.str.contains('04') | meth_450.index.str.contains('05') | meth_450.index.str.contains('06') | meth_450.index.str.contains('07') | meth_450.index.str.contains('08') | meth_450.index.str.contains('09') ]
label = "non-tumor"
meth_450["type"] = np.select(condition, label, default="tumor")
Traceback:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-3f71a5ae2941> in <cell line: 8>()
6 condition = meth_450.loc[meth_450.index.str.contains('11A')]
7 label = "non-tumor"
----> 8 meth_450["type"] = np.select(condition, label, default="tumor")
/shared-libs/python3.10/py/lib/python3.10/site-packages/numpy/core/overrides.py in select(*args, **kwargs)
/shared-libs/python3.10/py/lib/python3.10/site-packages/numpy/lib/function_base.py in select(condlist, choicelist, default)
784 # Check the size of condlist and choicelist are the same, or abort.
785 if len(condlist) != len(choicelist):
--> 786 raise ValueError(
787 'list of cases must be same length as list of conditions')
788
ValueError: list of cases must be same length as list of conditions
Example meth_450
dataframe (as dictionary):
meth_450 = pd.DataFrame({'TCGA-06-0125-02A':[0.1, 0.2, 0.3], 'TCGA-06-0125-12A':[0.4, 0.5, 0.6], 'TCGA-06-0125-04A':[0.7, 0.8, 0.9]})
Expected output:
cg001 | cg002 | cg003 | type | |
---|---|---|---|---|
TCGA-06-0125-02A | 0.1 | 0.2 | 0.3 | tumor |
TCGA-06-0125-12A | 0.4 | 0.5 | 0.6 | non-tumor |
TCGA-06-0125-04A | 0.7 | 0.8 | 0.9 | tumor |
CodePudding user response:
You can use string methods with regex to find the rows that have an index where the last portion ends in 01-09:
meth_450 = pd.DataFrame({'TCGA-06-0125-02A':[0.1, 0.2, 0.3], 'TCGA-06-0125-12A':[0.4, 0.5, 0.6], 'TCGA-06-0125-04A':[0.7, 0.8, 0.9]})
meth_450 = meth_450.T.assign(type=(
meth_450.T.index
.str.split('-')
.str[-1]
.str.contains(r'^0[1-9]', regex=True)
)).replace({'type':{True:'tumor', False:'non-tumor'}})
Output:
0 1 2 type
TCGA-06-0125-02A 0.1 0.2 0.3 tumor
TCGA-06-0125-12A 0.4 0.5 0.6 non-tumor
TCGA-06-0125-04A 0.7 0.8 0.9 tumor
CodePudding user response:
Another option is to use np.where()
to assign tumor or non-tumor
import numpy as np
dft = meth_450.T
dft['type'] = np.where(dft.index.str.match('\w{4}-\d\d-\d{4}-0\d\w'),'tumor','non-tumor')
Result
0 1 2 type
TCGA-06-0125-02A 0.1 0.2 0.3 tumor
TCGA-06-0125-12A 0.4 0.5 0.6 non-tumor
TCGA-06-0125-04A 0.7 0.8 0.9 tumor