I have this df:
df = pd.DataFrame.from_dict(
{
'Name': ['Jane', 'Melissa', 'John', 'Matt', 'Abernethy', 'Annie', 'Brook', 'Brian', 'Carrie'],
'Tag': ['tag1,tag2', 'tag1', 'tag4,tag3,tag7', 'tag2,tag9', 'tag1,tag3', 'tag3,tag4', 'tag9,tag2', 'tag3,tag2', 'tag1,tag5'],
}
)
which looks like this:
Name | Tag |
---|---|
Jane | tag1,tag2 |
Melissa | tag9,tag_wrong1 |
John | tag4,tag3,tag7 |
Matt | tag2,tag9 |
Abernethy | tag1,tag3 |
Annie | tag3,tag4,tag5 |
Brook | tag9,tag2 |
Brian | tag3,tag2 |
Carrie | tag1,tag5 |
My goal is to create a third column "Tag_after". The simple SQL case statement would be:
UPDATE table SET Tag_after =
CASE
WHEN Tag LIKE '%tag1%' THEN 'tag1'
WHEN Tag LIKE '%tag2%' THEN 'tag2'
WHEN Tag LIKE '%tag3%' THEN 'tag3'
WHEN Tag LIKE '%tag4%' THEN 'tag4'
WHEN Tag LIKE '%tag5%' THEN 'tag5'
WHEN Tag LIKE '%tag_wrong1%' THEN 'tag_right1'
ELSE Tag
END
tag1 has a higher priority than tag2, and so on
tag_wrong1 will be changed to tag_right1
The desired output is this:
Name | Tag | Tag_after |
---|---|---|
Jane | tag1,tag2 | tag1 |
Melissa | tag9,tag_wrong1 | tag_right1 |
John | tag4,tag3,tag7 | tag3 |
Matt | tag2,tag9 | tag2 |
Abernethy | tag1,tag3 | tag1 |
Annie | tag3,tag4,tag5 | tag3 |
Brook | tag9,tag2 | tag2 |
Brian | tag3,tag7 | tag3 |
Carrie | tag1,tag5 | tag1 |
My (wrong) try1:
import pandas as pd
tag_1 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_wrong1', 'tag9']
tag_2 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_right1', 'tag9']
df['Tag_after'] = ''
def set_visitor_tag(df, tag_before, tag_after, col_tag, add_col_tag):
i = 0
while i < len(tag_before):
df.loc[~df[col_tag].isnull() & df[col_tag].str.contains(tag_before[i]), [add_col_tag]] = tag_after[i]
i = i 1
set_visitor_tag(df, tag_1, tag_2, 'Tag', 'Tag_after')
The result of this processing is not the same as the priority of the permissions I set.
I think the function does multiple matching and assignment operations for each row of data, What I want is that after processing a row once, it will not be processed anymore.
My (wrong) try2 :
def set_visitor_tag(df, tag_before, tag_after, col_tag, add_col_tag):
i = 0
while i < len(tag_before):
if tag_before[i] in df[col_tag]:
df.loc[df[col_tag].str.contains(tag_before[i]), [add_col_tag]] = tag_after[i]
else:
continue
i = i 1
Many thx.
CodePudding user response:
One option is to use the case_when function from pyjanitor, which is similar to SQL's case when:
# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
df.case_when(
df.Tag.str.contains('tag1'), 'tag1', # condition, result
df.Tag.str.contains('tag2'), 'tag2',
df.Tag.str.contains('tag3'), 'tag3',
df.Tag.str.contains('tag4'), 'tag4',
df.Tag.str.contains('tag5'), 'tag5',
df.Tag.str.contains('tag_wrong1'), 'tag_right1',
df.Tag, # default if none of the conditions evaluate to True
column_name = 'Tag_after')
Out[11]:
Name Tag Tag_after
0 Jane tag1,tag2 tag1
1 Melissa tag9,tag_wrong1 tag_right1
2 John tag4,tag3,tag7 tag3
3 Matt tag2,tag9 tag2
4 Abernethy tag1,tag3 tag1
5 Annie tag3,tag4,tag5 tag3
6 Brook tag9,tag2 tag2
7 Brian tag3,tag2 tag2
8 Carrie tag1,tag5 tag1
Another option is to use numpy's select function:
condlist = [df.Tag.str.contains('tag1'), df.Tag.str.contains('tag2'),
df.Tag.str.contains('tag3'), df.Tag.str.contains('tag4'),
df.Tag.str.contains('tag5'), df.Tag.str.contains('tag_wrong1') ]
choicelist = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag_right1']
df.assign(Tag_after = np.select(condlist, choicelist, df.Tag))
Name Tag Tag_after
0 Jane tag1,tag2 tag1
1 Melissa tag9,tag_wrong1 tag_right1
2 John tag4,tag3,tag7 tag3
3 Matt tag2,tag9 tag2
4 Abernethy tag1,tag3 tag1
5 Annie tag3,tag4,tag5 tag3
6 Brook tag9,tag2 tag2
7 Brian tag3,tag2 tag2
8 Carrie tag1,tag5 tag1
CodePudding user response:
I'm sure RegEx is the key here, as it's a pattern matching process similar to SQL LIKE
, I think.
I wrote this on the assumption that if the tags don't have any of the "tag1 ... tag5 or tag_wrong1", then "Tag_after" is supposed to be the whole "Tag" value. E.g Matt's row tags are "tag8,tag9" and since it doesn't match any of the given tag patterns, Matt gets the tag_after value the same as his tags.
Input:
import pandas as pd
import re
df = pd.DataFrame.from_dict(
{
'Name': ['Jane', 'Melissa', 'John', 'Matt', 'Abernethy', 'Annie', 'Brook', 'Brian', 'Carrie'],
'Tag': ['tag1,tag2', 'tag9,tag_wrong1', 'tag4,tag3,tag7', 'tag8,tag9', 'tag1,tag3', 'tag3,tag4', 'tag9,tag2', 'tag3,tag2', 'tag1,tag5'],
}
)
df["Tag_after"] = ""
reg_list = ["^(.*?)tag1", "^(.*?)tag2", "^(.*?)tag3", "^(.*?)tag4", "^(.*?)tag5", "^(.*?)tag_wrong1"]
sub_list = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag_right1']
for index, row in df.iterrows():
for reg in reg_list:
if re.match(reg, row["Tag"]):
row["Tag_after"] = sub_list[reg_list.index(reg)]
break
if row["Tag_after"] == "":
row["Tag_after"] = row["Tag"]
print(df)
Output:
Name Tag Tag_after
0 Jane tag1,tag2 tag1
1 Melissa tag9,tag_wrong1 tag_right1
2 John tag4,tag3,tag7 tag3
3 Matt tag8,tag9 tag8,tag9
4 Abernethy tag1,tag3 tag1
5 Annie tag3,tag4 tag3
6 Brook tag9,tag2 tag2
7 Brian tag3,tag2 tag2
8 Carrie tag1,tag5 tag1
CodePudding user response:
In pandas for defined priority is used ordered Categorical
, so first get all matched values bySeries.str.extractall
with join values in list by |
, convert to Categorical
, aggregate by minimal for category with lowest priority and last mapping by dictionary created by both lists:
tag_1 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_wrong1', 'tag9']
tag_2 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_right1', 'tag9']
d = dict(zip(tag_1, tag_2))
s = df['Tag'].str.extractall(f'({"|".join(tag_1)})')[0]
cats = pd.Categorical(s, categories=tag_1, ordered=True)
df['Tag_after'] = pd.Series(cats, index=s.index).groupby(level=0).min().map(d)
print (df)
Name Tag Tag_after
0 Jane tag1,tag2 tag1
1 Melissa tag9,tag_wrong1 tag_right1
2 John tag4,tag3,tag7 tag3
3 Matt tag2,tag9 tag2
4 Abernethy tag1,tag3 tag1
5 Annie tag3,tag4 tag3
6 Brook tag9,tag2 tag2
7 Brian tag3,tag2 tag2
8 Carrie tag1,tag5 tag1