Assume that we have a dataframe df1[1000,7] and another dataframe df2[300,6].
I am setting an example to understand my problem.
df1:
col1 col2 col3 ... col8 newcol
alex george whatever ... maria nan
julia bill sea ... paul nan
... ... ... ... ... nan
chris paul micheal ... alex nan
my df2 is a subset of df for example the following:
df2:
col1 col2 col3 ... col8
alex george whatever ... maria
chris paul micheal ... alex
I want the new column in my df to get specific values depending if the row exists in my df2. For example let s say here I want to write 'ok' if the row exists and 'missing' if it doesn't. Thus my df should become
df1:
col1 col2 col3 ... col8 newcol
alex george whatever ... maria ok
julia bill sea ... paul missing
... ... ... ... ... ...
chris paul micheal ... alex ok
how can I do that with one command without using for loops in python? I have thought of using fill.na command to write "missing" after i have inserted the values for the rows that exist in my df, however I am a bit confused on the syntax on how to fill the values for the rows that exist in df2
EDIT: I noticed I inserted the newcol by accident in df2. It does not exist there
CodePudding user response:
I would simply merge (although technically there's no extra column you are merging) the 2 dataframes using the left as your "main" one. If you include the indicator though, it will create a column telling you if the rows are in both dataframes. So here, obviously if you see "both"
, it's "ok" and "left_only"
, it is "missing".
import pandas as pd
import numpy as np
df1 = pd.DataFrame([
['alex','george','maria'],
['julia','bill','paul'],
['chris','paul','alex']
], columns = ['col1','col2', 'col3'])
df2 = pd.DataFrame([
['alex','george','maria'],
['chris','paul','alex']
], columns = ['col1','col2', 'col3'])
merged = df1.merge(df2, how='left', on=list(df1.columns), indicator=True)
df1['newcol'] = np.where(merged['_merge'] == 'both', 'ok', 'missing')
Output:
print(merged)
col1 col2 col3 _merge
0 alex george maria both
1 julia bill paul left_only
2 chris paul alex both
And with the new column to df1
:
print(df1)
col1 col2 col3 newcol
0 alex george maria ok
1 julia bill paul missing
2 chris paul alex ok
CodePudding user response:
You might want to try this -
n_columns = len(df1.columns)
df1["newcol"] = (df1 == df2).apply(sum, axis=1).apply(lambda x: "ok" if x == n_columns else "missing")
df1