does anyone know how to concatenate multiple columns excluding duplicated values?
I'm a student of python, this is my first project and I have a problem
I have a dataset like this one:
each number represents a column from my dataframe
df = {'col1': ['a','b','a','c'], 'col2': ["","",'a',''], 'col3': ['','a','','b'], 'col4': ['a','','b',''], 'col2': ['b','c','c','']}
Need a output like this:
new colum
a-b
a-b-c
a-c-b
b-c
Need the data sorted, concatenated and with unique values
I was able to do this in excel using transpose, sort and unique, like this:
=TEXTJOIN("-";;TRANSPOSE(UNIQUE(SORT(TRANSPOSE(A1:E1)))))
But I couldn't figure out how to do it on pandas. Can anoyne help me plz?
CodePudding user response:
Example
you can make example by to_dict
and make null from None
data = {'col1': ['a','b','a','c'], 'col2': [None,None,'a',None], 'col3': [None,'a',None,'b'], 'col4': ['a',None,'b',None], 'col5': ['b','c','c',None]}
df = pd.DataFrame(data)
df
col1 col2 col3 col4 col5
0 a None None a b
1 b None a None c
2 a a None b c
3 c None b None None
Code
df.apply(lambda x: '-'.join(sorted(x.dropna().unique())) ,axis=1)
output:
0 a-b
1 a-b-c
2 a-b-c
3 b-c
dtype: object
CodePudding user response:
Assuming your DataFrame's identifier is just 'df', you could try something like this:
# List for output
newList = []
# Iterate over the DataFrame rows
for i, row in df.iterrows():
# Remove duplicates
row.drop_duplicates(inplace=True)
# Remove NaNs
row.dropna(inplace=True)
# Sort alphabetically
row.sort_values(inplace=True)
# Add adjusted row to the list
newList.append(row.to_list())
# Output
print(newList)
If your DataFrame isn't named 'df', just substitute 'df.iterrows()' for '[your dataframe].iterrows()'.
This gives the output:
[['a', 'b'], ['a', 'b', 'c'], ['a', 'b', 'c'], ['b', 'c']]
If you really need the output to be formatted like you said (a-b, a-b-c, etc): you can iterate over 'newList' to concatenate them and add the hyphens.