I have two dataframes: df_2008
, df_2012
which consist data about different diseases. index id
is an individual number of patient. And column dis1
is the special disease with binary information (yes/no).
df_2008 = pd.DataFrame(np.array([[11,'Yes', 'No', 'Yes'], [12,'Yes', 'Yes', 'No'], [13,'Yes', 'No', 'Yes']]),
columns=['id','dis1', 'dis2', 'dis3'])
df_2008.set_index('id',inplace=True)
df_2012 = pd.DataFrame(np.array([[11,'Yes', 'Yes','No'], [12,'Yes', 'No',"No"], [13,'Yes', 'No','No'],[14,'No', 'No','No']]),
columns=['id','dis1', 'dis2', 'dis3'])
df_2012.set_index('id',inplace=True)
I need to find all sick patient with dis1
from two dataframes considering id (number of patient). It is index. If I will have patient which is sick in 2008 (yes) and 2012 (yes) I want to put 2008, if I have sick patient only in 2012 (put 2012), if have not sick patient, I want to put 0 for example.
I tried:
def conditions(x,y):
if (x['dis1'] == 'Yes')&(y['dis1'] == 'Yes'):
return 2008
elif (x['dis1'] == 'Yes')&(y['dis1'] != 'Yes'):
return 2012
else:
return 0
conditions(df_2008, df_2012)
I got an error:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
CodePudding user response:
You could try doing an outer join of both dataframes, only using the relevant columns of each ("dis1"
). Then, fill a "result"
column by finding where each column has a NaN
:
>> output = pd.merge(df_2008["dis1"], df_2012["dis1"], how="outer", left_index=True, right_index=True) # You could add: suffixes=("_2008", "_2012")
>> output["result"] = 0
>> output.loc[output["dis1_x"].isna(), "result"] = 2012 # NaN in 2008, means only 2012
>> output.loc[output["dis1_y"].isna(), "result"] = 2008 # NaN in 2012, means only 2008
dis1_x dis1_y result
id
11 Yes Yes 0
12 Yes Yes 0
13 Yes Yes 0
14 NaN No 2012
CodePudding user response:
A reason of the error is that it is impossible to compare Seires
and str
, such as x['dist1' == 'Test'
.
Instead, you should explicitly make a condition, such as x['dist'][0].value == 'Yes'
or 'Yes' in x['dist']
. Otherwise, it would be confusing what you are exactly want.
There can be two cases:
- CASE #1: whether you want to check if 'Yes' among all values in the column
- CASE #2: whether you want to check if 'Yes' for each rows
CASE #1
import pandas as pd
import numpy as np
df_2008 = pd.DataFrame(np.array([[11, 'Yes', 'No', 'Yes'], [12, 'Yes', 'Yes', 'No'], [13, 'Yes', 'No', 'Yes']]),
columns=['id', 'dis1', 'dis2', 'dis3'])
df_2008.set_index('id', inplace=True)
df_2012 = pd.DataFrame(
np.array([[11, 'Yes', 'Yes', 'No'], [12, 'Yes', 'No', "No"], [13, 'Yes', 'No', 'No'], [14, 'No', 'No', 'No']]),
columns=['id', 'dis1', 'dis2', 'dis3'])
df_2012.set_index('id', inplace=True)
def conditions(x, y):
if ('Yes' in x['dis1'].values) and ('Yes' in y['dis1']):
return 2008
elif ('Yes' in x['dis1'].values) and ('Yes' not in y['dis1']):
return 2012
else:
return 0
result = conditions(df_2008, df_2012)
print(result)
#2012
CASE #2
import pandas as pd
import numpy as np
df_2008 = pd.DataFrame(np.array([[11, 'Yes', 'No', 'Yes'], [12, 'Yes', 'Yes', 'No'], [13, 'Yes', 'No', 'Yes']]),
columns=['id', 'dis1', 'dis2', 'dis3'])
df_2008.set_index('id', inplace=True)
df_2012 = pd.DataFrame(
np.array([[11, 'Yes', 'Yes', 'No'], [12, 'Yes', 'No', "No"], [13, 'Yes', 'No', 'No'], [14, 'No', 'No', 'No']]),
columns=['id', 'dis1', 'dis2', 'dis3'])
df_2012.set_index('id', inplace=True)
df = pd.DataFrame()
df = pd.merge(df_2008["dis1"].reset_index(), df_2012["dis1"].reset_index(), how="outer", on=['id'])
print(df)
# id dis1_x dis1_y
#0 11 Yes Yes
#1 12 Yes Yes
#2 13 Yes Yes
#3 14 NaN No
df['result'] = np.where(
(df['dis1_x'] =='Yes') & (df['dis1_y']=='Yes'), 2008,
np.where((df['dis1_x'] =='Yes') & (df['dis1_y']!='Yes'), 2012, 0)
)
print(df)
# id dis1_x dis1_y result
#0 11 Yes Yes 2008
#1 12 Yes Yes 2008
#2 13 Yes Yes 2008
#3 14 NaN No 0