I have a df
Patient ID A 72 A SD75 A 74 A 74 B 71 C 72
And I have an expression
exp = '((71 72)*((73 75) SD75))*((74 76) SD76))'
Now I need to evaluate this expression with ones and zeros if there's a match in the df for each of the three patients A, B , C .
A has a match with ID 72, SD75, 74 so the expressions should be
A- '((0 1)*((0 0) 1))*((1 0) 0))' B- '((1 0)*((0 0) 0))*((0 0) 0))' C- '((0 1)*((0 0) 0))*((0 0) 0))'
And My final df_output should look like this
Patient FinalVal A 1 B 0 C 0The FinalVal can be obtained by eval(exp) after replacing the ID's with 1's and O's
so Far here is where I reached. When I am replacing the ID 75 with 0 the SD75 is becoming SD0 and that's where I am stuck
import pandas as pd
import re
exp = '((71 72)*((73 75) SD75))*((74 76) SD76))'
mylist = re.sub(r'[^\w]', ' ', exp).split()
distinct_pt = df.Patient.drop_duplicates().dropna()
df_output = pd.DataFrame(distinct_pt)
df_output['Exp'] = exp
for i in distinct_pt:
new_df = df[df.Patient == i]
colt = new_df['ID'].tolist()
for j in mylist:
if j in colt:
df_output.Exp = df_output.Exp.replace(j, '1')
else:
df_output.Exp = df_output.Exp.replace(j, '0')
CodePudding user response:
One way that would work for this exact scenario is to sort mylist in descending order so the items preceded by SD are iterated before the others.
mylist = re.sub(r'[^\w]', ' ', exp).split()
mylist.sort(reverse=True)
CodePudding user response:
We can try creating an indicator DataFrame using a Series.get_dummies
to create indicator columns for each value in the ID
column, then reduce to a single row per Patient
via groupby max
:
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
indicator_df
:
71 72 74 SD75
Patient
A 0 1 1 1
B 1 0 0 0
C 0 1 0 0
Now we can reindex
from the expression terms to create missing columns. np.unique
is used to ensure that duplicate terms in the expression do not result in duplicate columns in indicator_df
(this can be omitted if it is guaranteed there are no duplicate terms):
exp = '(((71 72)*((73 75) SD75))*((74 76) SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
# Ensure All expression terms are present
indicator_df = indicator_df.reindex(
columns=np.unique(cols), # prevent duplicate cols
fill_value=0 # Added cols are filled with 0
)
indicator_df
:
71 72 73 74 75 76 SD75 SD76
Patient
A 0 1 0 1 0 0 1 0
B 1 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0
Now if we alter the exp
slightly by surrounding these new columns names with backticks (`) we can use DataFrame.eval
to compute the expression:
exp = '(((71 72)*((73 75) SD75))*((74 76) SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# create indicator_df (chained)
indicator_df = (
df.set_index('Patient')['ID']
.str.get_dummies()
.groupby(level=0).max()
.reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
# Add Backticks around columns names
re.sub(r'(\w )', r'`\1`', exp)
).reset_index(name='FinalVal')
result
:
Patient FinalVal
0 A 1
1 B 0
2 C 0
The backticks are necessary to indicate these values represent column names, and not numeric values:
re.sub(r'(\w )', r'`\1`', exp)
# (((`71` `72`)*((`73` `75`) `SD75`))*((`74` `76`) `SD76`))
Notice the difference between 71 with backticks vs without:
# Column '71' the number 71
pd.DataFrame({'71': [1, 2, 3]}).eval('B = `71` 71')
71 B
0 1 72
1 2 73
2 3 74
Alternatively, the indicator_df
can be created with a pivot_table
after adding an indicator column with assign
:
exp = '(((71 72)*((73 75) SD75))*((74 76) SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
indicator_df = (
df.assign(
indicator=1 # Assign Indicator Value of 1
).pivot_table(
index='Patient',
columns='ID',
values='indicator',
aggfunc='first', # Handle Multiple 1s accept any
fill_value=0 # Missing gets filled with 0
).reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
# Add Backticks around columns names
re.sub(r'(\w )', r'`\1`', exp)
).reset_index(name='FinalVal')
Setup and imports used:
import re
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Patient': ['A', 'A', 'A', 'A', 'B', 'C'],
'ID': ['72', 'SD75', '74', '74', '71', '72']
})