I have a data frame with one column,
DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"]})
I want to add another column with the substring of files, the final dataframe should look like
DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"], 'stain': ["PAS", "HE1", "HE1"]})
I try
DF["Stain"] = DF.apply(lambda row: row.files[re.search(r'[a-zA-Z]{2,}', row.files).start():], axis=1)
But it returned
AttributeError: 'NoneType' object has no attribute 'start'
What should I do?
CodePudding user response:
If you want to extract last 3 characters from the files
column you can do:
DF["stain"] = DF["files"].str[-3:]
print(DF)
Prints:
files stain
0 S18-000344PAS PAS
1 S18-001850HE1 HE1
2 S18-00344HE1 HE1
EDIT: Using regular expression to extract the stain
:
DF["stain"] = DF["files"].str.extract(r"^(?:.{2,})-\d*(. )")
print(DF)
CodePudding user response:
Here's one approach using the str
accessor
DF[["files", "stain"]] = DF["files"].str.extract(pat="(. \d)(\D. )")
files stain
0 S18-000344 PAS
1 S18-001850 HE1
2 S18-00344 HE1
If you need to keep the extracted variable in the first column, you can do
DF["stain"] = DF["files"].str.extract(pat="(. \d)(\D. )")[1]
files stain
0 S18-000344PAS PAS
1 S18-001850HE1 HE1
2 S18-00344HE1 HE1