How to assign column value based on index substring?-CodePudding

If the index ends with "01-09"A, label it as "tumor" in the "type" column.
If the index ends with "10-19"A, label it as "normal" in the "type" column. The column name is multi-index. How do I assign the "type" column accordingly?

# 01-09 : tumor
# 10-19 : normal

meth_450_5_kipan = meth_450_5_kipan.assign(type=(
    meth_450_5_kipan.index
    .str.split('-')
    .str[-1]
    .str.contains(r'0[1-9]', regex=True)
)).replace({'type':{True:"tumor", False:"normal"}}).dropna(axis=1)

Current output:

meth_450_5_kipan.iloc[-5:,-5:]

pd.DataFrame({('cg09560763', 'PRDM8'): {'TCGA-Y8-A898-01A': 0.505822845732314,
  'TCGA-Y8-A8RY-01A': 0.494413413161009,
  'TCGA-Y8-A8RZ-11A': 0.301740989582562,
  'TCGA-Y8-A8S0-01A': 0.235758339404136,
  'TCGA-Y8-A8S1-01A': 0.731030638674928},
 ('cg09560811', nan): {'TCGA-Y8-A898-01A': 0.933102042099432,
  'TCGA-Y8-A8RY-01A': 0.9097565027488,
  'TCGA-Y8-A8RZ-11A': 0.920238344141844,
  'TCGA-Y8-A8S0-01A': 0.924803871437567,
  'TCGA-Y8-A8S1-01A': 0.929761655129724},
 ('cg09560911', 'TNFRSF21'): {'TCGA-Y8-A898-01A': 0.0262547882636862,
  'TCGA-Y8-A8RY-01A': 0.031638387180189,
  'TCGA-Y8-A8RZ-11A': 0.0304795189432937,
  'TCGA-Y8-A8S0-01A': 0.0255867247450433,
  'TCGA-Y8-A8S1-01A': 0.0234602952079715},
 ('cg09560953', 'UBE2E1'): {'TCGA-Y8-A898-01A': 0.901422948355672,
  'TCGA-Y8-A8RY-01A': 0.851164164393655,
  'TCGA-Y8-A8RZ-11A': 0.707673764192998,
  'TCGA-Y8-A8S0-01A': 0.721923173082175,
  'TCGA-Y8-A8S1-01A': 0.835676721188431},
 ('type', ''): {'TCGA-Y8-A898-01A': True,
  'TCGA-Y8-A8RY-01A': True,
  'TCGA-Y8-A8RZ-11': False,
  'TCGA-Y8-A8S0-01A': True,
  'TCGA-Y8-A8S1-01A': True}}

Expected output:

{('cg09560763', 'PRDM8'): {'TCGA-Y8-A898-01A': 0.505822845732314,
  'TCGA-Y8-A8RY-01A': 0.494413413161009,
  'TCGA-Y8-A8RZ-11A': 0.301740989582562,
  'TCGA-Y8-A8S0-01A': 0.235758339404136,
  'TCGA-Y8-A8S1-01A': 0.731030638674928},
 ('cg09560811', nan): {'TCGA-Y8-A898-01A': 0.933102042099432,
  'TCGA-Y8-A8RY-01A': 0.9097565027488,
  'TCGA-Y8-A8RZ-11A': 0.920238344141844,
  'TCGA-Y8-A8S0-01A': 0.924803871437567,
  'TCGA-Y8-A8S1-01A': 0.929761655129724},
 ('cg09560911', 'TNFRSF21'): {'TCGA-Y8-A898-01A': 0.0262547882636862,
  'TCGA-Y8-A8RY-01A': 0.031638387180189,
  'TCGA-Y8-A8RZ-11A': 0.0304795189432937,
  'TCGA-Y8-A8S0-01A': 0.0255867247450433,
  'TCGA-Y8-A8S1-01A': 0.0234602952079715},
 ('cg09560953', 'UBE2E1'): {'TCGA-Y8-A898-01A': 0.901422948355672,
  'TCGA-Y8-A8RY-01A': 0.851164164393655,
  'TCGA-Y8-A8RZ-11A': 0.707673764192998,
  'TCGA-Y8-A8S0-01A': 0.721923173082175,
  'TCGA-Y8-A8S1-01A': 0.835676721188431},
 ('type', ''): {'TCGA-Y8-A898-01A': True,
  'TCGA-Y8-A8RY-01A': tumor,
  'TCGA-Y8-A8RZ-11': normal,
  'TCGA-Y8-A8S0-01A': tumor,
  'TCGA-Y8-A8S1-01A': tumor}})

CodePudding user response：

The code you've provided does not run correctly. This is because the .str.contains() method returns a boolean series, and the .replace() method is used for replacing values in a dataframe not for creating new columns.

Here is possible approach that will accomplish what you're trying to do:

meth_450_5_kipan['type'] = meth_450_5_kipan.index.str.split('-').str[-1].str.contains(r'0[1-9]', regex=True)
meth_450_5_kipan['type'] = meth_450_5_kipan['type'].replace({True: "tumor", False: "normal"})
meth_450_5_kipan = meth_450_5_kipan.dropna(axis=1)

This code creates a new column 'type' in the dataframe by using the .str.contains() method on the index. Then replace boolean values by tumor and normal using the replace method. Finally, drop the columns with NaN values using dropna().

This should create a new 'type' column in the dataframe with the desired values, and drop the columns with NaN values.