How to create conditional column in pandas dataframe using regex?-CodePudding

I want to create a conditional column type where:

If the last part of the index starts with 01 to 09, the label is tumor (for example, TCGA-06-0125-02A is tumor)
Otherwise, label as non-tumor (e.g., TCGA-06-0125-12A is non-tumor)

Code:

import numpy as np
import pandas as pd

# 01-09 : tumor
# 10-19 : normal

# Color the PCA plot by tumor vs non-tumor 
condition = meth_450.loc[meth_450.index.contains('01') | meth_450.index.str.contains('02') | meth_450.index.str.contains('03') | meth_450.index.str.contains('04') | meth_450.index.str.contains('05') | meth_450.index.str.contains('06') | meth_450.index.str.contains('07')  | meth_450.index.str.contains('08') | meth_450.index.str.contains('09') ] 
label = "non-tumor"
meth_450["type"] = np.select(condition, label, default="tumor")

Traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-3f71a5ae2941> in <cell line: 8>()
      6 condition = meth_450.loc[meth_450.index.str.contains('11A')]
      7 label = "non-tumor"
----> 8 meth_450["type"] = np.select(condition, label, default="tumor")

/shared-libs/python3.10/py/lib/python3.10/site-packages/numpy/core/overrides.py in select(*args, **kwargs)

/shared-libs/python3.10/py/lib/python3.10/site-packages/numpy/lib/function_base.py in select(condlist, choicelist, default)
    784     # Check the size of condlist and choicelist are the same, or abort.
    785     if len(condlist) != len(choicelist):
--> 786         raise ValueError(
    787             'list of cases must be same length as list of conditions')
    788 

ValueError: list of cases must be same length as list of conditions

Example meth_450 dataframe (as dictionary):

meth_450 = pd.DataFrame({'TCGA-06-0125-02A':[0.1, 0.2, 0.3], 'TCGA-06-0125-12A':[0.4, 0.5, 0.6], 'TCGA-06-0125-04A':[0.7, 0.8, 0.9]})

Expected output:

	cg001	cg002	cg003	type
TCGA-06-0125-02A	0.1	0.2	0.3	tumor
TCGA-06-0125-12A	0.4	0.5	0.6	non-tumor
TCGA-06-0125-04A	0.7	0.8	0.9	tumor

CodePudding user response：

You can use string methods with regex to find the rows that have an index where the last portion ends in 01-09:

meth_450 = pd.DataFrame({'TCGA-06-0125-02A':[0.1, 0.2, 0.3], 'TCGA-06-0125-12A':[0.4, 0.5, 0.6], 'TCGA-06-0125-04A':[0.7, 0.8, 0.9]})
meth_450 = meth_450.T.assign(type=(
    meth_450.T.index
    .str.split('-')
    .str[-1]
    .str.contains(r'^0[1-9]', regex=True)   
)).replace({'type':{True:'tumor', False:'non-tumor'}})

Output:

                    0    1    2       type
TCGA-06-0125-02A  0.1  0.2  0.3      tumor
TCGA-06-0125-12A  0.4  0.5  0.6  non-tumor
TCGA-06-0125-04A  0.7  0.8  0.9      tumor

CodePudding user response：

Another option is to use np.where() to assign tumor or non-tumor

import numpy as np

dft = meth_450.T
dft['type'] = np.where(dft.index.str.match('\w{4}-\d\d-\d{4}-0\d\w'),'tumor','non-tumor')

Result

                    0    1    2       type
TCGA-06-0125-02A  0.1  0.2  0.3      tumor
TCGA-06-0125-12A  0.4  0.5  0.6  non-tumor
TCGA-06-0125-04A  0.7  0.8  0.9      tumor