I'm trying to use the re library to use regular expression to substitute text. However, even after I use astype(str), the column that I want to change is still being stored as an object.
import os
import re
import pandas as pd
#https://www.geeksforgeeks.org/update-column-value-of-csv-in-python/
# reading the csv file
df = pd.read_excel('Upload to dashboard - Untitled (1).xlsx', skiprows = 7)
df.drop(df.tail(7).index,inplace=True) # drop last n rows
print(df.head(3))
df['Publisher URL'] = df['Publisher URL'].fillna('')
df['Publisher URL'] = df['Publisher URL'].astype(str)
df['Publisher URL'] = df['Publisher URL'].str.replace('gerd#','')
print(df.dtypes)
df['Publisher URL2'] = df['Publisher URL'].str.replace('www.','')
trim = re.sub('(.*)(?:\bm\.)(.*)|(.*)','',df['Publisher URL'])
#https://docs.python.org/3/library/re.html#re.sub
print(df['trim'])
df.to_csv("C:/out.csv")
this is my output for print(df.dtypes)
[3 rows x 8 columns]
Unnamed: 0 object
Buying Agency object
Advertiser object
Publisher URL object
Date object
Buy Type object
Ad Type object
No. of Ads float64
And my error is
File ~\.spyder-py3\temp.py:30 in <module>
trim = re.sub('(.*)(?:\bm\.)(.*)|(.*)','',df['Publisher URL'])
File ~\Anaconda3\lib\re.py:210 in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
How can I change my script so I can use the re.sub function?
The regex is correct, I had some help from this question:Regex Exclusions
CodePudding user response:
You can use
df['trim'] = df['Publisher URL'].str.replace(r'^(.*?)\bm\.', r'\1', regex=True)
Or, a better idea is to use
df['trim'] = df['Publisher URL'].str.replace(r'^(.*?[/.])m\.', r'\1', regex=True)
See regex demo #1 and regex demo #2
Details:
^
- start of string(.*?[/.])
- Group 1 (in Python,\1
, not$1
, refers to this group pattern match): any chars other than line break chars, as few as possible, and then a/
or.
m\.
-m.
substring.