I am trying to combine multiple columns from a dataframe into a new column in the same dataframe. Those columns could either have a string value or are Na or NaN. Whenever one column is Na/NaN, I would like these columns to not be included in the final string.
E.g.
a b c d RESULT
0 AA BB CC DD AA;BB;CC;DD
1 ab Na cd da ab;cd;da
2 Na xx Na Na xx
3 Na Na Na Na Na
I have tested multiple functions already.
df['RESULT'] = df['a'] ";" df['b'] ...
does not work as it will still nest the Na's.df['RESULT'] = ";".join(df['a'],df['b],...)
does not work as join just takes one argument (and I have 4)df['RESULTS'] = [f"{a};{b};{c}" for a,b,c in zip(df['a'],df['b'], df['b'])]
does not work as it adds the Na's as string to the output- Pandas str.cat():
df['fill_name']= df['RESULTS'].str.cat(df['a'],sep=";").str.cat(df['b'],sep=";")...
is the closest to what I am looking for, but as soon as there is one Na in one column, the whole output is Na.
In the end I am looking into something like the "TEXTJOIN" function in Excel.
CodePudding user response:
A combo with pandas.DataFrame.stack
and GroupBy.agg
:
cols = ["a", "b", "c", "d"]
df["RESULT"] = df[cols].stack().groupby(level=0).agg(";".join)
Output :
print(df)
a b c d RESULT
0 AA BB CC DD AA;BB;CC;DD
1 ab NaN cd da ab;cd;da
2 NaN xx NaN NaN xx
3 NaN NaN NaN NaN NaN
CodePudding user response:
Use DataFrame.stack
for remove missing values and aggregate join
:
columns = ['a','b','c','d']
df['RESULT'] = df[columns].stack().groupby(level=0).agg(';'.join)
print (df)
a b c d RESULT
0 AA BB CC DD AA;BB;CC;DD
1 ab NaN cd da ab;cd;da
2 NaN xx NaN NaN xx
3 NaN NaN NaN NaN NaN
Or remove missing values in custom function with replace empty strings:
df['RESULT'] = df[columns].agg(lambda x: ";".join(x.dropna()), axis=1).replace('',np.nan)
print (df)
a b c d RESULT
0 AA BB CC DD AA;BB;CC;DD
1 ab NaN cd da ab;cd;da
2 NaN xx NaN NaN xx
3 NaN NaN NaN NaN NaN