How to add values in a pandas data frame based on values of two columns of one of the data frame mer-CodePudding

I need to calculate the sensitivity and specificity of a test based on another test that is my control. To do this I need to merge three data frames.

The first concatenation is between a column that contains all cases with another column that contains the control test´s results. (I know how to do this but I show this previous step to let you understand what I need to do at the end).

First data frame:

data = [['ch1.1234578C>T'], ['ch2.123459G>A'], ['ch3.234569A>T'], ['chX.246890A>G']]
 
comparison = pd.DataFrame(data, columns = ['All_common_variants_ID'])

comparison

All_common_variants_ID
1 ch1.1234578C>t
2 ch2.123459G>A
3 ch3.234569A>T
4 chX.246890A>G

Second data frame:

data = [['ch1.1234578C>T'], ['ch2.123459G>A']]
 
control = pd.DataFrame(data, columns = ['Sample_ID'])

control

Sample_ID
1 ch1.1234578C>T
2 ch2.123459G>A

I have merged these two data frames with this code:

comparative = comparison.merge(control[['Sample_ID']],left_on='All_common_variants_ID',right_on='Sample_ID',how='outer').fillna('Real negative')
comparative = comparative.rename(columns={'Sample_ID': 'CONTROL'})
comparative

All_common_variants_ID  CONTROL

1 ch1.1234578C>T          ch1.1234578C>T  
2 ch2.123459G>A           ch2.123459G>A
3 ch3.234569A>T           Real negative
4 chX.246890A>G           Real negative

Now is where I have the problem.

I need to concatenate a third data frame (test) under conditions with the first and the second column of the comparative data frame.

The conditions are:

If a value of the test data frame matches with a value in the second column add a "True-positive".
If. value doesn't match with any value from the second column add " False-negative".
If a value that matches with a value of the first column and the second column is "Real negative" add a "False-positive".
For the rest of the cells, add "True-negative".

Following the sample provide, this would be the expected result.

All_common_variants_ID  CONTROL                 Test

1 ch1.1234578C>T          ch1.1234578C>T       True-positive       # ch1.1234578C>T match with the second column
2 ch2.123459G>A           ch2.123459G>A        False-negative      # ch2.123459G>A is not in my test column
3 ch3.234569A>T           Real negative        False-positive      # ch3.234569A>T match with first column but second column is real negative
4 chX.246890A>G           Real negative        True-negative       # chX.246890A>G is not in my test column and is not in the control column.

Some comments:

There are no repetitive values in any column
All_common_variants_ID contains all values between control and test columns.

CodePudding user response：

Use np.select

# Setup test dataframe
data = [['ch1.1234578C>T'], ['ch3.234569A>T']]
test = pd.DataFrame(data, columns=['Test'])

# Build variables to np.select
condlist = [comparative['CONTROL'].isin(test['Test']),
            ~comparative['CONTROL'].isin(test['Test'])
                & comparative['CONTROL'].ne('Real negative'),
            comparative['All_common_variants_ID'].isin(test['Test'])
                & comparative['CONTROL'].eq('Real negative')]

choicelist = ['True-positive', 'False-negative', 'False-positive']

default = 'True-negative'

# Create new column
comparative['Test'] = np.select(condlist, choicelist, default)

Output:

>>> comparative
  All_common_variants_ID         CONTROL            Test
0         ch1.1234578C>T  ch1.1234578C>T   True-positive
1          ch2.123459G>A   ch2.123459G>A  False-negative
2          ch3.234569A>T   Real negative  False-positive
3          chX.246890A>G   Real negative   True-negative