Map pandas series with string pattern-CodePudding

Hi I would like to map a pandas series using a string pattern

s=pd.DataFrame([['AMcU8', 10], ['AM8v', 15], ['ASw9', 14],['ASw7', 14]], columns = ['Code', 'Quantity'])

s["newcode"]=s["Code"].map({"AM.*8.*" : "AM8", "AS.*9.*" : "AS9"})

but I get this:

   Code  Quantity newcode
0  AMcU8       10     NaN
1  AM8v        15     NaN
2  ASw9        14     NaN
3  ASw7        14     NaN

instead of:

   Code  Quantity newcode
0  AMcU8       10     AM8
1  AM8v        15     AM8
2  ASw9        14     AS9
3  ASw7        14     NaN

any idea? it's fine to get a NaN when it doesn't find a match

CodePudding user response：

You can use Series.replace with the parameter regex set to your mapping dictionary (documentation):

s["newcode"] = s["Code"].replace(regex={"AM.*8.*":"AM8", "AS.*9.*": "AS9"})

which produces:

    Code    Quantity    newcode
0   AMcU8   10          AM8
1   AM8v    15          AM8
2   ASw9    14          AS9
3   ASw7    14          ASw7

Note that non-matching patterns are left unchanged.

CodePudding user response：

To my knowledge there is no direct function to perform this operation.

You can do this using apply() and re and iterate through your mapping dictionary as follows:

mapping = {"AM.*8" : "AM8", "AS.*9" : "AS9"}
import re

def regex_mapping(x):
    for k, v in mapping.items():
        if re.match(k, x):
            return re.sub(k, v, x)
    return x

s['Code'].apply(regex_mapping)

Output:

0     AM8
1     AM8
2     AS9
3    ASw7
Name: Code, dtype: object

CodePudding user response：

As far as I know, you can't provide regex keys to Series.map().

However, this does what you need:

import re
import pandas as pd

s = pd.DataFrame([['AMcU8', 10], ['AM8', 15], ['ASw9', 14], ['ASw7', 14]], columns=['Code', 'Quantity'])


def regex_replace(x, map: dict = None):
    for regex, replacement in map.items():
        if re.match(regex, x):
            return replacement
    else:
        return x


s["newcode"] = s["Code"].apply(regex_replace, map={"AM.*8": "AM8", "AS.*9": "AS9"})

Or if you apply this to large DataFrames frequently and want it to be a bit faster and more efficient in that case:

import re
import pandas as pd
from functools import partial

s = pd.DataFrame([['AMcU8', 10], ['AM8', 15], ['ASw9', 14], ['ASw7', 14]], columns=['Code', 'Quantity'])


def regex_replace(map: dict = None, x=None):
    for regex, replacement in map.items():
        if regex.match(x):
            return replacement
    else:
        return x

mapping = partial(regex_replace, {re.compile("AM.*8"): "AM8", re.compile("AS.*9"): "AS9"})
s["newcode"] = s["Code"].apply(mapping)