Home > database >  How to find the same values from two dataframes considering index?
How to find the same values from two dataframes considering index?

Time:02-18

I have two dataframes: df_2008, df_2012 which consist data about different diseases. index id is an individual number of patient. And column dis1 is the special disease with binary information (yes/no).

df_2008 = pd.DataFrame(np.array([[11,'Yes', 'No', 'Yes'], [12,'Yes', 'Yes', 'No'], [13,'Yes', 'No', 'Yes']]),
                   columns=['id','dis1', 'dis2', 'dis3'])
df_2008.set_index('id',inplace=True)
df_2012 = pd.DataFrame(np.array([[11,'Yes', 'Yes','No'], [12,'Yes', 'No',"No"], [13,'Yes', 'No','No'],[14,'No', 'No','No']]),
                   columns=['id','dis1', 'dis2', 'dis3'])
df_2012.set_index('id',inplace=True)

I need to find all sick patient with dis1 from two dataframes considering id (number of patient). It is index. If I will have patient which is sick in 2008 (yes) and 2012 (yes) I want to put 2008, if I have sick patient only in 2012 (put 2012), if have not sick patient, I want to put 0 for example.

I tried:

def conditions(x,y):
    if (x['dis1'] == 'Yes')&(y['dis1'] == 'Yes'):
        return 2008
    elif (x['dis1'] == 'Yes')&(y['dis1'] != 'Yes'): 
        return 2012
    else:
        return 0
    
conditions(df_2008, df_2012)

I got an error:

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

CodePudding user response:

You could try doing an outer join of both dataframes, only using the relevant columns of each ("dis1"). Then, fill a "result" column by finding where each column has a NaN:

>> output = pd.merge(df_2008["dis1"], df_2012["dis1"], how="outer", left_index=True, right_index=True)  # You could add: suffixes=("_2008", "_2012")
>> output["result"] = 0
>> output.loc[output["dis1_x"].isna(), "result"] = 2012  # NaN in 2008, means only 2012
>> output.loc[output["dis1_y"].isna(), "result"] = 2008  # NaN in 2012, means only 2008
   dis1_x dis1_y  result
id                      
11    Yes    Yes       0
12    Yes    Yes       0
13    Yes    Yes       0
14    NaN     No    2012

CodePudding user response:

A reason of the error is that it is impossible to compare Seires and str, such as x['dist1' == 'Test'.

Instead, you should explicitly make a condition, such as x['dist'][0].value == 'Yes' or 'Yes' in x['dist']. Otherwise, it would be confusing what you are exactly want.

There can be two cases:

  1. CASE #1: whether you want to check if 'Yes' among all values in the column
  2. CASE #2: whether you want to check if 'Yes' for each rows

CASE #1

import pandas as pd
import numpy as np

df_2008 = pd.DataFrame(np.array([[11, 'Yes', 'No', 'Yes'], [12, 'Yes', 'Yes', 'No'], [13, 'Yes', 'No', 'Yes']]),
                       columns=['id', 'dis1', 'dis2', 'dis3'])
df_2008.set_index('id', inplace=True)
df_2012 = pd.DataFrame(
    np.array([[11, 'Yes', 'Yes', 'No'], [12, 'Yes', 'No', "No"], [13, 'Yes', 'No', 'No'], [14, 'No', 'No', 'No']]),
    columns=['id', 'dis1', 'dis2', 'dis3'])
df_2012.set_index('id', inplace=True)


def conditions(x, y):
    if ('Yes' in x['dis1'].values) and ('Yes' in y['dis1']):
        return 2008
    elif ('Yes' in x['dis1'].values) and ('Yes' not in y['dis1']):
        return 2012
    else:
        return 0


result = conditions(df_2008, df_2012)
print(result)
#2012

CASE #2

import pandas as pd
import numpy as np

df_2008 = pd.DataFrame(np.array([[11, 'Yes', 'No', 'Yes'], [12, 'Yes', 'Yes', 'No'], [13, 'Yes', 'No', 'Yes']]),
                       columns=['id', 'dis1', 'dis2', 'dis3'])
df_2008.set_index('id', inplace=True)
df_2012 = pd.DataFrame(
    np.array([[11, 'Yes', 'Yes', 'No'], [12, 'Yes', 'No', "No"], [13, 'Yes', 'No', 'No'], [14, 'No', 'No', 'No']]),
    columns=['id', 'dis1', 'dis2', 'dis3'])
df_2012.set_index('id', inplace=True)

df = pd.DataFrame()
df = pd.merge(df_2008["dis1"].reset_index(), df_2012["dis1"].reset_index(), how="outer", on=['id'])

print(df)
#   id dis1_x dis1_y
#0  11    Yes    Yes
#1  12    Yes    Yes
#2  13    Yes    Yes
#3  14    NaN     No

df['result'] = np.where(
    (df['dis1_x'] =='Yes') & (df['dis1_y']=='Yes'), 2008,
    np.where((df['dis1_x'] =='Yes') & (df['dis1_y']!='Yes'), 2012, 0)
)
print(df)
#   id dis1_x dis1_y  result
#0  11    Yes    Yes    2008
#1  12    Yes    Yes    2008
#2  13    Yes    Yes    2008
#3  14    NaN     No       0
  • Related