changing a dataframe based on another dataframe in python-CodePudding

Assume that we have a dataframe df1[1000,7] and another dataframe df2[300,6].
I am setting an example to understand my problem.

df1:  
 col1    col2      col3           ...     col8     newcol
 alex    george    whatever       ...     maria    nan
 julia   bill      sea            ...     paul     nan
 ...     ...        ...           ...     ...      nan
 chris   paul      micheal        ...     alex     nan

my df2 is a subset of df for example the following:

df2:
 col1    col2      col3           ...     col8     
 alex    george    whatever       ...     maria    
 chris   paul      micheal        ...     alex

I want the new column in my df to get specific values depending if the row exists in my df2. For example let s say here I want to write 'ok' if the row exists and 'missing' if it doesn't. Thus my df should become

df1:  
 col1    col2      col3           ...     col8     newcol
 alex    george    whatever       ...     maria    ok
 julia   bill      sea            ...     paul     missing
 ...     ...        ...           ...     ...      ...
 chris   paul      micheal        ...     alex     ok

how can I do that with one command without using for loops in python? I have thought of using fill.na command to write "missing" after i have inserted the values for the rows that exist in my df, however I am a bit confused on the syntax on how to fill the values for the rows that exist in df2

EDIT: I noticed I inserted the newcol by accident in df2. It does not exist there

CodePudding user response：

I would simply merge (although technically there's no extra column you are merging) the 2 dataframes using the left as your "main" one. If you include the indicator though, it will create a column telling you if the rows are in both dataframes. So here, obviously if you see "both", it's "ok" and "left_only", it is "missing".

import pandas as pd
import numpy as np

df1 = pd.DataFrame([
    ['alex','george','maria'],
    ['julia','bill','paul'],
    ['chris','paul','alex']
    ], columns = ['col1','col2', 'col3'])


df2 = pd.DataFrame([
    ['alex','george','maria'],
    ['chris','paul','alex']
    ], columns = ['col1','col2', 'col3'])


merged = df1.merge(df2, how='left', on=list(df1.columns), indicator=True)
df1['newcol'] = np.where(merged['_merge'] == 'both', 'ok', 'missing')

Output:

print(merged)
    col1    col2   col3     _merge
0   alex  george  maria       both
1  julia    bill   paul  left_only
2  chris    paul   alex       both

And with the new column to df1:

print(df1)
    col1    col2   col3   newcol
0   alex  george  maria       ok
1  julia    bill   paul  missing
2  chris    paul   alex       ok

CodePudding user response：

You might want to try this -

n_columns = len(df1.columns)
df1["newcol"] = (df1 == df2).apply(sum, axis=1).apply(lambda x: "ok" if x == n_columns else "missing")
df1