Home > Software engineering >  How do modify pandas index using regex and then compare it to the index of another dataframe?
How do modify pandas index using regex and then compare it to the index of another dataframe?

Time:06-02

I want to rename the indices of my pandas dataframe by retaining only the substring before the third hyphen. My code doesn't modify the indices. Why?

import re

for i in meth_450.index:
    re.sub(r"^[^-]*-[^-]*:[^-]*", "", i)

meth_450.index

Index(['TCGA-06-0125-01A-01D-A45W-05', 'TCGA-06-0125-02A-11D-2004-05',
       'TCGA-06-0152-01A-02D-A45W-05', 'TCGA-06-0152-02A-01D-2004-05',
       'TCGA-06-0171-01A-02D-A45W-05', 'TCGA-06-0171-02A-11D-2004-05',
       'TCGA-06-0190-01A-01D-A45W-05', 'TCGA-06-0190-02A-01D-2004-05',
       'TCGA-06-0210-01A-01D-A45W-05', 'TCGA-06-0210-02A-01D-2004-05'],
      dtype='object', length=155)

Desired output:

TCGA-06-0125, TCGA-06-0125,
TCGA-06-0152, TCGA-06-0152,
TCGA-06-0171, TCGA-06-0171,
TCGA-06-0190, TCGA-06-0190,
TCGA-06-0210, TCGA-06-0210

Ultimately, I want to match this dataframe to another dataframe:

clin = clin[clin.index.isin(meth_450.index)]

CodePudding user response:

index = pd.Index(['TCGA-06-0125-01A-01D-A45W-05', 'TCGA-06-0125-02A-11D-2004-05',
       'TCGA-06-0152-01A-02D-A45W-05', 'TCGA-06-0152-02A-01D-2004-05',
       'TCGA-06-0171-01A-02D-A45W-05', 'TCGA-06-0171-02A-11D-2004-05',
       'TCGA-06-0190-01A-01D-A45W-05', 'TCGA-06-0190-02A-01D-2004-05',
       'TCGA-06-0210-01A-01D-A45W-05', 'TCGA-06-0210-02A-01D-2004-05']
)

# You can extract by character count if your index is always consistent
index.str[:12]

# if you want to use regex: use . ? for non-greedy match
index.str.extract("^(. ?-. ?-. ?)-")[0]

CodePudding user response:

try this

import re

for i in meth_450.index:
    re.sub(r"^\w*[-]\w*[-]\w*", "", i)

you have an error in your regex, it should be ^[^-]*-[^-]*-[^-]* not ^[^-]*-[^-]*:[^-]*

CodePudding user response:

Try re.sub(r"-\w{3}-\w{3}-\w{4}-\d\d", "", i)

CodePudding user response:

Don't forget to assign back after substitution whatever method you use:

meth_450.index = meth_450.index.str.extract(r'^([^-] -[^-] -[^-] )', expand=False)
print(meth_450.index)

# Output
Index(['TCGA-06-0125', 'TCGA-06-0125', 'TCGA-06-0152', 'TCGA-06-0152',
       'TCGA-06-0171', 'TCGA-06-0171', 'TCGA-06-0190', 'TCGA-06-0190',
       'TCGA-06-0210', 'TCGA-06-0210'],
      dtype='object')
  • Related