Is there a way to iterate through a pandas column and remove certain characters?-CodePudding

Sample data:

/SomeText/2016-11-11
/SomeText/2016-11-11/13.40.48
/SomeText/15.T-06-26/00.00.00

I tried to compile a list of regex patterns but it was not working. I am not well skilled with regex patterns. What I am trying to do is make a list of regex patterns that matches anything after:

/SomeText/

and remove it from the column and store the new results in a new pandas column.

CodePudding user response：

There are various options but I believe best practice is to use pd.apply().

So you want to create a function to apply to every row in a column and then apply it to that column such as:

df = df["<column_name>"].apply(lambda x: re.sub(r"\/[a-zA-Z]*\/", "", x))

To explain what this is doing, it's applying a one time function...

(lambda x:re.sub(r"\/[a-zA-Z]*\/", "", x)

to every x in the column that is supplied.

The re.sub bit is matching a forward slash (with "/") then any number of letters (with "[a-zA-Z]*"), and then another forward slash. It is replace anything that matches this with an empty string.

CodePudding user response：

The correct regex to match "/SomeText/" is \/SomeText\/. Note that I am assuming here that you do have a specific string you want to clean out -- rather than cleaning out all alphabetic characters, etc.

Let's say your data is stored in a column called "orig" of dataframe df. You can separate out "/SomeText/" and the text that follows like this:

df['orig'].str.extract(r"(\/SomeText\/)(.*)")

CodePudding user response：

Use str.extract:

df = pd.DataFrame({'Text': ['/SomeText/2016-11-11', 
                            '/SomeText/2016-11-11/13.40.48',
                            '/SomeText/15.T-06-26/00.00.00']})

df['Text'] = df['Text'].str.extract(r'(.*SomeText)/')
print(df)

# Output:
        Text
0  /SomeText
1  /SomeText
2  /SomeText