Concatenate columns ignoring duplicates with pandas-CodePudding

does anyone know how to concatenate multiple columns excluding duplicated values?

I'm a student of python, this is my first project and I have a problem

I have a dataset like this one:

each number represents a column from my dataframe

df = {'col1': ['a','b','a','c'], 'col2': ["","",'a',''], 'col3': ['','a','','b'], 'col4': ['a','','b',''], 'col2': ['b','c','c','']}

Need a output like this:

new colum

a-b

a-b-c

a-c-b

b-c

Need the data sorted, concatenated and with unique values

I was able to do this in excel using transpose, sort and unique, like this:

=TEXTJOIN("-";;TRANSPOSE(UNIQUE(SORT(TRANSPOSE(A1:E1)))))

But I couldn't figure out how to do it on pandas. Can anoyne help me plz?

CodePudding user response：

Example

you can make example by to_dict and make null from None

data = {'col1': ['a','b','a','c'], 'col2': [None,None,'a',None], 'col3': [None,'a',None,'b'], 'col4': ['a',None,'b',None], 'col5': ['b','c','c',None]}
df = pd.DataFrame(data)

df

    col1    col2    col3    col4    col5
0   a       None    None    a       b
1   b       None    a       None    c
2   a       a       None    b       c
3   c       None    b       None    None

Code

df.apply(lambda x: '-'.join(sorted(x.dropna().unique())) ,axis=1)

output:

0      a-b
1    a-b-c
2    a-b-c
3      b-c
dtype: object

CodePudding user response：

Assuming your DataFrame's identifier is just 'df', you could try something like this:

# List for output
newList = []

# Iterate over the DataFrame rows
for i, row in df.iterrows():
    # Remove duplicates
    row.drop_duplicates(inplace=True)
    # Remove NaNs
    row.dropna(inplace=True)
    # Sort alphabetically
    row.sort_values(inplace=True)
    # Add adjusted row to the list
    newList.append(row.to_list())

# Output
print(newList)

If your DataFrame isn't named 'df', just substitute 'df.iterrows()' for '[your dataframe].iterrows()'.

This gives the output:

[['a', 'b'], ['a', 'b', 'c'], ['a', 'b', 'c'], ['b', 'c']]

If you really need the output to be formatted like you said (a-b, a-b-c, etc): you can iterate over 'newList' to concatenate them and add the hyphens.