Pandas df how to parse column value to extract string to int with regex-CodePudding

I have a Pandas df with 2 columns:

    name       Count_Relationship
0   allicin    DOWNREGULATE: 1
1   allicin    DOWNREGULATE: 2
2   allicin    UPREGULATE: 1 | DOWNREGULATE: 1
3   aspirin    UPREGULATE: 5 | DOWNREGULATE: 1
4   albuterol  DOWNREGULATE: 1
5   albuterol  UPREGULATE: 3

I would like to filter out only the rows where if I group by the 'names' and count in the 'Count_Relationship' column the amount of DOWNREGULATE to be more than the amount of the UPREGULATE. In this case allicin would have DOWREGULATE 1 2 1=4 and UPREGULATE =1 so num_downregulate>num_upregulate, while in the rest(aspirin,albuterol) that's not the case. I would like to return this filtered df:

    name      Count_Relationship
0   allicin   DOWNREGULATE: 1
1   allicin   DOWNREGULATE: 2
2   allicin   UPREGULATE: 1 | DOWNREGULATE: 1

The column Count_Relationship is a string, so I would have to parse the number part of the string and convert it to int.

I tried this:

    import pandas as pd

    data = {'name': ['allicin', 'allicin', 'allicin', 'aspirin', 'albuterol', 'albuterol'],
    'Count_Relationship': ['DOWNREGULATE: 1', 'DOWNREGULATE: 2', 'UPREGULATE: 1 | DOWNREGULATE: 1', 'UPREGULATE: 5 | DOWNREGULATE: 1', 'DOWNREGULATE: 1' , 'UPREGULATE: 3']
    }

    df = pd.DataFrame(data)

    substances = df["name"].tolist()
    substances = list(set(substances)) # to get the unique names

    result_substances = []
    
    for substance in (substances):
        try:
            numberOfdownregulate = df[(df["name"] == substance) & (\
            (df["Count_Relationship"].str.match(pat = '("DOWNREGULATE:"([0-9]))')).values[0].astype(int)        
        except:
            pass
        try:    
            numberOfupregulate = df[(df["name"] == substance) & (\
            (df["Count_Relationship"].str.match(pat = '("UPREGULATE:"([0-9]))')).values[0].astype(int)
        except:
            pass
    
        result = numberOfdownregulate - numberOfupregulate
        
        if result > 0:
            result_substances.append(substance)


    df_filtered = df[df["name"].isin(result_substances)]

but I get a syntax error at the line numberOfdownregulate where my regex is. How can fix the algorithm? thanks so much

CodePudding user response：

You could extract the infos, compare the up and down, and build a mask to select the data:

drugs = (df.join(df['Count_Relationship'].str.extractall('(?P<down>(?<=DOWNREGULATE: )\d )|(?P<up>(?<=UPREGULATE: )\d )')
                   .groupby(level=0).first().fillna(0).astype(int)
                 )
           .groupby('name').agg({'down': 'sum', 'up': 'sum'})
           .query('down >= up')
           .index
        )

df[df['name'].isin(drugs)]

output:

      name               Count_Relationship
0  allicin                  DOWNREGULATE: 1
1  allicin                  DOWNREGULATE: 2
2  allicin  UPREGULATE: 1 | DOWNREGULATE: 1

CodePudding user response：

I suggest extracting the DOWNREGULATE and UPREGULATE values into different columns, then apply the sum of the values grouped by name and check which is bigger.

The example below creates an additional boolean column named UP_gt_DOWN, literally UPREGULATE greater than DOWNREGULATE:

df['UPREGULATE'] = df['Count_Relationship'].str.extract(r"UPREGULATE: (\d*)").fillna(0).astype(int)
df['DOWNREGULATE'] = df['Count_Relationship'].str.extract(r"DOWNREGULATE: (\d*)").fillna(0).astype(int)

summed_df = df.groupby('name').sum()
summed_df['UP_gt_DOWN'] = summed_df['UPREGULATE'] > summed_df['DOWNREGULATE']
print(summed_df)

# Output
#            UPREGULATE  DOWNREGULATE  UP_gt_DOWN
# name                                           
# albuterol           3             1        True
# allicin             1             4       False
# aspirin             5             1        True

filtered_drugs = summed_df[~summed_df['UP_gt_DOWN']].index
print(df[df['name'].isin(filtered_drugs)])

# Output
#       name               Count_Relationship  UPREGULATE  DOWNREGULATE
# 0  allicin                  DOWNREGULATE: 1           0             1
# 1  allicin                  DOWNREGULATE: 2           0             2
# 2  allicin  UPREGULATE: 1 | DOWNREGULATE: 1           1             1