Python Pandas .str.extract method fails when indexing-CodePudding

I'd like to set values on a slice of a DataFrame using .loc using pandas str extract method .str.extract() however, it's not working due to indexing errors. This code works perfectly if I swap extract with contains.

Here is a sample frame:

import pandas as pd

df = pd.DataFrame(
    {
        'name': [
            'JUNK-0003426', 'TEST-0003435', 'JUNK-0003432', 'TEST-0003433', 'TEST-0003436',
        ], 
        'value': [
            'Junk', 'None', 'Junk', 'None', 'None',
        ]
    }
)

Here is my code:

df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d )")

How can I set the None values to the extracted regex string

CodePudding user response：

Hmm the problem seems to be that .str.extract returns a pd.DataFrame, you can .squeeze it to turn it into a series and it seems to work fine:

df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d )").squeeze()

indexing alignment takes care of the rest.

CodePudding user response：

Instead of trying to get the group, you can replace the rest with the empty string:

df.loc[df['value']=='None', 'value'] = df.loc[df['value']=='None', 'name'].str.replace('TEST-\d{3}', '')

Was this answer helpful to your problem?

CodePudding user response：

Here is a way to do it:

df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d )").loc[:,0]

Output:

           name value
0  JUNK-0003426  Junk
1  TEST-0003435  3435
2  JUNK-0003432  Junk
3  TEST-0003433  3433
4  TEST-0003436  3436