I have two columns and I'm trying to make a new one depending if one has a null value or not. I have the following example and I´m trying to use the np.where() function but it doesn´t seems to work.

import pandas as pd 
import numpy as np 

# DF 1
a = pd.DataFrame([1,'nan',2],columns=['a1'])
# DF 2
b = pd.DataFrame(['hola','hola','hola'],columns=['b1']) 
# New Column
b['b2'] = np.where(a['a1'].astype(str) != 'nan', b['b1']   a['a1'].astype(str))

The result for the new column 'b2' should be:

hola1  
hola 
hola2

The np.where function also doesn´t has like an else option so I don´t know how to include that. I appreciate the help!

CodePudding user response：

You need to "synchronize" a and b dataframes to make pairwise comparisons (possibly with pd.concat):

b['b2'] = pd.concat([a.replace({'nan': ''}), b], axis=1).apply(lambda x:x['b1']   str(x['a1']), axis=1)
print(b)

     b1     b2
0  hola  hola1
1  hola   hola
2  hola  hola2

CodePudding user response：

This is an alternative solution using list comprehension:

b['b2'] = [str(j) if i=='nan' else str(j)   str(i) for i, j in zip(a.a1,b.b1) ]

print(b)

     b1     b2
0  hola  hola1
1  hola   hola
2  hola  hola2

I have used zip to loop both columns together.

CodePudding user response：

You can do a simple string concatenation:

>>> b['b1']   a['a1'].astype(str).replace('nan', '')

0    hola1
1     hola
2    hola2
dtype: object

CodePudding user response：

@RomanPerekhrest's solution is fine and you should upvote it. There is a vectorial solution if you like to try.

df = pd.concat([a, b], axis=1)

df["b2"] = df["b1"]   df["a1"].astype("str")\
    .str.replace("nan", "")

which returns

    a1    b1     b2
0    1  hola  hola1
1  nan  hola   hola
2    2  hola  hola2

In case you really want to use np.where you can try

df["b3"] = np.where(
    df['a1'].eq('nan'),
    df['b1'],
    df['b1']   df['a1'].astype(str))

which leads to

    a1    b1     b2     b3
0    1  hola  hola1  hola1
1  nan  hola   hola   hola
2    2  hola  hola2  hola2

EDIT - Timing

Here a small experiment when we assume a, and b have 30.000 rows.

Data

import pandas as pd
n = 10_000
a = pd.DataFrame([1,'nan',2],columns=['a1'])
# DF 2
b = pd.DataFrame(['hola','hola','hola'],columns=['b1'])
a = pd.concat([a for i in range(n)],
             ignore_index=True)
b = pd.concat([b for i in range(n)],
             ignore_index=True)

a_bk = a.copy()
b_bk = b.copy()

@RomanPerekhrest's solution

%%timeit -n 10 -r 3
a = a_bk.copy()
b = b_bk.copy()
b['b2'] = pd.concat(
    [a.replace({'nan': ''}), 
     b],
    axis=1)\
    .apply(lambda x:x['b1']   str(x['a1']), axis=1)

305 ms ± 17.8 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

Vectorial solution

%%timeit -n 10 -r 3
a = a_bk.copy()
b = b_bk.copy()
df = pd.concat([a, b], axis=1)

df["b2"] = df["b1"]   df["a1"].astype("str")\
    .str.replace("nan", "")

33.6 ms ± 4.34 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

Conclusion

Vectorial solution is 9x faster than using an apply.