Home > Back-end >  Python Categorize Dataframe Column Conditionally Using Regular Expression
Python Categorize Dataframe Column Conditionally Using Regular Expression

Time:11-18

I have a dataframe:

group   id
A   009x
A   010x
B   009x
B   002x
C   002x
C   003x

How do I make a new column new that categorizes conditionally under the following three conditions by group:

  1. If all id values consist of ONLY 009x and 010x, then categorize as g1
  2. If the id value is one of 009x or 010x AND another id value is not one of 009x or 010x, then categorize as g2
  3. Otherwise, just print the id value

Desired result:

group   id  new
A   009x    g1
A   010x    g1
B   009x    g2
B   002x    g2
C   002x    002x
C   003x    003x
data = {
    'group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'id': ['009x', '010x', '009x', '002x', '002x', '003x'], 
    }  
df = pd.DataFrame(data)  
df

CodePudding user response:

I hope I've understood your question right. You can use .groupby() custom function:

def categorize_fn(x):
    tmp = x["id"].isin(["009x", "010x"])

    if tmp.all():
        x["new"] = "g1"
    elif tmp.any():
        x["new"] = "g2"
    else:
        x["new"] = x["id"]

    return x


df = df.groupby("group", group_keys=False).apply(categorize_fn)
print(df)

Prints:

  group    id   new
0     A  009x    g1
1     A  010x    g1
2     B  009x    g2
3     B  002x    g2
4     C  002x  002x
5     C  003x  003x
  • Related