I am new to programming and specially regex. I have encountered a problem mapping a dictionary items to a pandas dataframe column.
A Minimal reproducible example would be as following (my original dataset is a large one):
my csv file looks like:
id | color | status |
---|---|---|
1 | red | "this is equal to the / number 3" |
2 | yellow | you should visit the url \n http:13/color/findings/7 |
67 | green | conver it to a new value |
7 | blue | "this is equal to the / number 13" |
8 | green | conver it to a new value |
23 | white | you should visit the url \n http:13/color/findings/67 |
The result I would like to have is modifying the status of each element to be more generic:
id | color | status |
---|---|---|
1 | red | "this is equal to a number" |
2 | yellow | you should visit the corresponding website |
67 | green | conver it to a new value |
7 | blue | "this is equal to a number" |
8 | green | conver it to a new value |
23 | white | you should visit the corresponding website |
the method I would like to use is creating a dictionary which the keys and values are the corresponnding status comments and replace them:
my_dict = {
'"this is equal to the \/ number \d "' : '"this is equal to a number"',
'you should visit the url \\n http:\d \/color\/findings\/\d ' : 'you should visit the corresponding website',
'conver it to a new value' : 'conver it to a new value'
}
then for the first method, I tried to replace them by mapping:
df['status'] = [next((v for k,v in my_dict.items() if k in x), float('nan')) for x in df['status'].tolist()]
which gives me only the status which is similar to the original key value : "conver it to a new value"
And also I tried:
dictkeys_pattern = re.compile('|'.join(my_dict), re.IGNORECASE)
status_found = df['status'].str.findall(my_dict)
stat = []
for i in status_found:
for k, v in my_dict.items():
if re.match(k, i, re.IGNORECASE):
stat.append(v)
else:
stat = None
if status_found:
stat = []
for i in status_found:
for k, v in my_dict.items():
if re.match(k, i, re.IGNORECASE):
stat.append(v)
else:
stat = None
However, status_found is an empty Series.
Could someoe help me and show me which part I am doing wrong?
CodePudding user response:
Try this ....
import pandas as pd
dic = {'id': [1, 2, 3, 4,5,6],
'color': ['red', 'yellow', 'green', 'blue','green','white'],
'status': ['this is equal to the / number 3', 'you should visit the url \n http:13/color/findings/7', 'conver it to a new value',
'this is equal to the / number 13',' conver it to a new value','you should visit the url \n http:13/color/findings/6']}
df = pd.DataFrame(dic)
print(df)
#creating a new list (new status)
status1=['this is equal to a number', 'you should visit the corresponding website', 'conver it to a new value',
'this is equal to the / number 13',' conver it to a new value','you should visit the url \n http:13/color/findings/6']
df['status'] = status1 # replacing older with new one (colname intact while replacing its contents with newcol)
print('\n',df)
CodePudding user response:
I don't know how many different cases you would like to modify, but you could simply write a function that replaces the status if a regex pattern could be found. You can then create a new column and use the cleaner function with .apply()
like this.
import pandas as pd
import re
# Create sample data
d = {'color': ['red', 'yellow', 'green'],
'status': ['"this is equal to the / number 3"', 'you should visit the url \n http:13/color/findings/7',
'conver it to a new value']}
df = pd.DataFrame(d)
# Define cleaner function
def cleaner(x):
new_x = re.sub(r'.*( number\s?\d ).*', '"this is equal to a number"', x)
return new_x
# Create new column with cleaner function
df['status_cleaned'] = df['status'].apply(lambda x: cleaner(x))
print(df)
Output:
status | status_cleaned | |
---|---|---|
0 | "this is equal to the / number 3" | "this is equal to a number" |
1 | you should visit the url \n http:13/color/find... | you should visit the url \n http:13/color/find... |
2 | conver it to a new value | conver it to a new value |
Simply, add additional re.sub()
-lines to the cleaner function for other cases you would like to modify.
CodePudding user response:
You can use str.replace
dataf["status"] = (
dataf["status"]
.str.replace(r"http\:\d /color/findings/\d ","website", regex=True)
.str.replace(r"\d |/","", regex=True)
)
First we replace urls to website and then remove digits and forward slash.