Evaluating an Expression using data frames-CodePudding

I have a df

Patient  ID
A        72        
A        SD75
A        74
A        74
B        71
C        72

And I have an expression

    exp = '((71 72)*((73 75) SD75))*((74 76) SD76))'

Now I need to evaluate this expression with ones and zeros if there's a match in the df for each of the three patients A, B , C .
A has a match with ID 72, SD75, 74 so the expressions should be

    A- '((0 1)*((0 0) 1))*((1 0) 0))'
    B- '((1 0)*((0 0) 0))*((0 0) 0))'
    C- '((0 1)*((0 0) 0))*((0 0) 0))'

And My final df_output should look like this

Patient  FinalVal
A        1
B        0
C        0

The FinalVal can be obtained by eval(exp) after replacing the ID's with 1's and O's

so Far here is where I reached. When I am replacing the ID 75 with 0 the SD75 is becoming SD0 and that's where I am stuck

import pandas as pd
import re
exp = '((71 72)*((73 75) SD75))*((74 76) SD76))'
mylist = re.sub(r'[^\w]', ' ', exp).split()
distinct_pt = df.Patient.drop_duplicates().dropna()
df_output = pd.DataFrame(distinct_pt)
df_output['Exp'] = exp
for i in distinct_pt:
    new_df = df[df.Patient == i]
    colt = new_df['ID'].tolist()
    for j in mylist:
        if j in colt:
            df_output.Exp = df_output.Exp.replace(j, '1')
        else:
            df_output.Exp = df_output.Exp.replace(j, '0')

CodePudding user response：

One way that would work for this exact scenario is to sort mylist in descending order so the items preceded by SD are iterated before the others.

mylist = re.sub(r'[^\w]', ' ', exp).split()   
mylist.sort(reverse=True)

CodePudding user response：

We can try creating an indicator DataFrame using a Series.get_dummies to create indicator columns for each value in the ID column, then reduce to a single row per Patient via groupby max:

# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()

indicator_df:

         71  72  74  SD75
Patient                  
A         0   1   1     1
B         1   0   0     0
C         0   1   0     0

Now we can reindex from the expression terms to create missing columns. np.unique is used to ensure that duplicate terms in the expression do not result in duplicate columns in indicator_df (this can be omitted if it is guaranteed there are no duplicate terms):

exp = '(((71 72)*((73 75) SD75))*((74 76) SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
# Ensure All expression terms are present
indicator_df = indicator_df.reindex(
    columns=np.unique(cols),  # prevent duplicate cols
    fill_value=0  # Added cols are filled with 0
)

indicator_df:

         71  72  73  74  75  76  SD75  SD76
Patient                                    
A         0   1   0   1   0   0     1     0
B         1   0   0   0   0   0     0     0
C         0   1   0   0   0   0     0     0

Now if we alter the exp slightly by surrounding these new columns names with backticks (`) we can use DataFrame.eval to compute the expression:

exp = '(((71 72)*((73 75) SD75))*((74 76) SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# create indicator_df (chained)
indicator_df = (
    df.set_index('Patient')['ID']
        .str.get_dummies()
        .groupby(level=0).max()
        .reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
    # Add Backticks around columns names
    re.sub(r'(\w )', r'`\1`', exp)
).reset_index(name='FinalVal')

result:

  Patient  FinalVal
0       A         1
1       B         0
2       C         0

The backticks are necessary to indicate these values represent column names, and not numeric values:

re.sub(r'(\w )', r'`\1`', exp)

# (((`71` `72`)*((`73` `75`) `SD75`))*((`74` `76`) `SD76`))

Notice the difference between 71 with backticks vs without:

# Column '71'   the number 71
pd.DataFrame({'71': [1, 2, 3]}).eval('B = `71`   71')

   71   B
0   1  72
1   2  73
2   3  74

Alternatively, the indicator_df can be created with a pivot_table after adding an indicator column with assign:

exp = '(((71 72)*((73 75) SD75))*((74 76) SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
indicator_df = (
    df.assign(
        indicator=1  # Assign Indicator Value of 1
    ).pivot_table(
        index='Patient',
        columns='ID',
        values='indicator',
        aggfunc='first',  # Handle Multiple 1s accept any
        fill_value=0  # Missing gets filled with 0
    ).reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
    # Add Backticks around columns names
    re.sub(r'(\w )', r'`\1`', exp)
).reset_index(name='FinalVal')

Setup and imports used:

import re

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Patient': ['A', 'A', 'A', 'A', 'B', 'C'],
    'ID': ['72', 'SD75', '74', '74', '71', '72']
})