Home > Back-end >  Pandas equivalent of SQL case when statement to create new column
Pandas equivalent of SQL case when statement to create new column

Time:11-11

I have this df:

df = pd.DataFrame.from_dict(
    {
        'Name': ['Jane', 'Melissa', 'John', 'Matt', 'Abernethy', 'Annie', 'Brook', 'Brian', 'Carrie'],
        'Tag': ['tag1,tag2', 'tag1', 'tag4,tag3,tag7', 'tag2,tag9', 'tag1,tag3', 'tag3,tag4', 'tag9,tag2', 'tag3,tag2', 'tag1,tag5'],
    }
)

which looks like this:

Name Tag
Jane tag1,tag2
Melissa tag9,tag_wrong1
John tag4,tag3,tag7
Matt tag2,tag9
Abernethy tag1,tag3
Annie tag3,tag4,tag5
Brook tag9,tag2
Brian tag3,tag2
Carrie tag1,tag5

My goal is to create a third column "Tag_after". The simple SQL case statement would be:

UPDATE table SET Tag_after =
CASE
    WHEN Tag LIKE '%tag1%'           THEN 'tag1'
    WHEN Tag LIKE '%tag2%'           THEN 'tag2'
    WHEN Tag LIKE '%tag3%'           THEN 'tag3'
    WHEN Tag LIKE '%tag4%'           THEN 'tag4'
    WHEN Tag LIKE '%tag5%'           THEN 'tag5'
    WHEN Tag LIKE '%tag_wrong1%'     THEN 'tag_right1'
    ELSE Tag
END

tag1 has a higher priority than tag2, and so on

tag_wrong1 will be changed to tag_right1

The desired output is this:

Name Tag Tag_after
Jane tag1,tag2 tag1
Melissa tag9,tag_wrong1 tag_right1
John tag4,tag3,tag7 tag3
Matt tag2,tag9 tag2
Abernethy tag1,tag3 tag1
Annie tag3,tag4,tag5 tag3
Brook tag9,tag2 tag2
Brian tag3,tag7 tag3
Carrie tag1,tag5 tag1

My (wrong) try1:

import pandas as pd

tag_1 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_wrong1', 'tag9']
tag_2 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_right1', 'tag9']


df['Tag_after'] = ''


def set_visitor_tag(df, tag_before, tag_after, col_tag, add_col_tag):
    i = 0
    while i < len(tag_before):
        df.loc[~df[col_tag].isnull() & df[col_tag].str.contains(tag_before[i]), [add_col_tag]] = tag_after[i]
        i = i   1

set_visitor_tag(df, tag_1, tag_2, 'Tag', 'Tag_after')

The result of this processing is not the same as the priority of the permissions I set.

I think the function does multiple matching and assignment operations for each row of data, What I want is that after processing a row once, it will not be processed anymore.

My (wrong) try2 :

def set_visitor_tag(df, tag_before, tag_after, col_tag, add_col_tag):
    i = 0
    while i < len(tag_before):
        if tag_before[i] in df[col_tag]:
            df.loc[df[col_tag].str.contains(tag_before[i]), [add_col_tag]] = tag_after[i]
        else:
            continue
        i = i   1

Many thx.

CodePudding user response:

One option is to use the case_when function from pyjanitor, which is similar to SQL's case when:

# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
df.case_when(
    df.Tag.str.contains('tag1'), 'tag1', # condition, result
     df.Tag.str.contains('tag2'), 'tag2',
     df.Tag.str.contains('tag3'), 'tag3',
     df.Tag.str.contains('tag4'), 'tag4',
     df.Tag.str.contains('tag5'), 'tag5',
     df.Tag.str.contains('tag_wrong1'), 'tag_right1',
     df.Tag, # default if none of the conditions evaluate to True
    column_name = 'Tag_after')
Out[11]: 
        Name              Tag   Tag_after
0       Jane        tag1,tag2        tag1
1    Melissa  tag9,tag_wrong1  tag_right1
2       John   tag4,tag3,tag7        tag3
3       Matt        tag2,tag9        tag2
4  Abernethy        tag1,tag3        tag1
5      Annie   tag3,tag4,tag5        tag3
6      Brook        tag9,tag2        tag2
7      Brian        tag3,tag2        tag2
8     Carrie        tag1,tag5        tag1

Another option is to use numpy's select function:

condlist = [df.Tag.str.contains('tag1'), df.Tag.str.contains('tag2'),
            df.Tag.str.contains('tag3'), df.Tag.str.contains('tag4'),
            df.Tag.str.contains('tag5'), df.Tag.str.contains('tag_wrong1') ]
choicelist = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag_right1']
df.assign(Tag_after = np.select(condlist, choicelist, df.Tag))

        Name              Tag   Tag_after
0       Jane        tag1,tag2        tag1
1    Melissa  tag9,tag_wrong1  tag_right1
2       John   tag4,tag3,tag7        tag3
3       Matt        tag2,tag9        tag2
4  Abernethy        tag1,tag3        tag1
5      Annie   tag3,tag4,tag5        tag3
6      Brook        tag9,tag2        tag2
7      Brian        tag3,tag2        tag2
8     Carrie        tag1,tag5        tag1

CodePudding user response:

I'm sure RegEx is the key here, as it's a pattern matching process similar to SQL LIKE, I think.

I wrote this on the assumption that if the tags don't have any of the "tag1 ... tag5 or tag_wrong1", then "Tag_after" is supposed to be the whole "Tag" value. E.g Matt's row tags are "tag8,tag9" and since it doesn't match any of the given tag patterns, Matt gets the tag_after value the same as his tags.

Input:

import pandas as pd
import re
df = pd.DataFrame.from_dict(
    {
        'Name': ['Jane', 'Melissa', 'John', 'Matt', 'Abernethy', 'Annie', 'Brook', 'Brian', 'Carrie'],
        'Tag': ['tag1,tag2', 'tag9,tag_wrong1', 'tag4,tag3,tag7', 'tag8,tag9', 'tag1,tag3', 'tag3,tag4', 'tag9,tag2', 'tag3,tag2', 'tag1,tag5'],
    }
)
df["Tag_after"] = ""

reg_list = ["^(.*?)tag1", "^(.*?)tag2", "^(.*?)tag3", "^(.*?)tag4", "^(.*?)tag5", "^(.*?)tag_wrong1"]
sub_list = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag_right1']
for index, row in df.iterrows():
    for reg in reg_list:
        if re.match(reg, row["Tag"]):
            row["Tag_after"] = sub_list[reg_list.index(reg)]
            break
    if row["Tag_after"] == "":
        row["Tag_after"] = row["Tag"]

print(df)

Output:

        Name              Tag   Tag_after
0       Jane        tag1,tag2        tag1
1    Melissa  tag9,tag_wrong1  tag_right1
2       John   tag4,tag3,tag7        tag3
3       Matt        tag8,tag9   tag8,tag9
4  Abernethy        tag1,tag3        tag1
5      Annie        tag3,tag4        tag3
6      Brook        tag9,tag2        tag2
7      Brian        tag3,tag2        tag2
8     Carrie        tag1,tag5        tag1

CodePudding user response:

In pandas for defined priority is used ordered Categorical, so first get all matched values bySeries.str.extractall with join values in list by |, convert to Categorical, aggregate by minimal for category with lowest priority and last mapping by dictionary created by both lists:

tag_1 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_wrong1', 'tag9']
tag_2 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_right1', 'tag9']

d = dict(zip(tag_1, tag_2))

s = df['Tag'].str.extractall(f'({"|".join(tag_1)})')[0]
cats = pd.Categorical(s, categories=tag_1, ordered=True)
df['Tag_after'] = pd.Series(cats, index=s.index).groupby(level=0).min().map(d)
print (df)
        Name              Tag   Tag_after
0       Jane        tag1,tag2        tag1
1    Melissa  tag9,tag_wrong1  tag_right1
2       John   tag4,tag3,tag7        tag3
3       Matt        tag2,tag9        tag2
4  Abernethy        tag1,tag3        tag1
5      Annie        tag3,tag4        tag3
6      Brook        tag9,tag2        tag2
7      Brian        tag3,tag2        tag2
8     Carrie        tag1,tag5        tag1
  • Related