Pandas unique values per row, variable number of columns with data-CodePudding

Consider the below dataframe:

import pandas as pd
from numpy import nan

data = [
    (111, nan, nan, 111),
    (112, 112, nan, 115),
    (113, nan, nan, nan),
    (nan, nan, nan, nan),
    (118, 110, 117, nan),
]

df = pd.DataFrame(data, columns=[f'num{i}' for i in range(len(data[0]))])

    num0    num1    num2    num3
0   111.0   NaN     NaN     111.0
1   112.0   112.0   NaN     115.0
2   113.0   NaN     NaN     NaN
3   NaN     NaN     NaN     NaN
4   118.0   110.0   117.0   NaN

Assuming my index is unique, I'm looking to retrieve the unique values per index row, to an output like the one below. I wish to keep the empty rows.

    num1    num2    num3
0   111.0   NaN     NaN
1   112.0   115.0   NaN
2   113.0   NaN     NaN
3   NaN     NaN     NaN
4   110.0   117.0   118.0

I have a working, albeit slow, solution, see below. The output number order is not relevant, as long all values are presented to the leftmost column and nulls to the right. I'm looking for best practices and potential ideas to speed up the code. Thank you in advance.

def arrange_row(row):
    values = list(set(row.dropna(axis=1).values[0]))
    values = [nan] if not values else values
    series = pd.Series(values, index=[f"num{i}" for i in range(1, len(values) 1)])
    return series

df.groupby(level=-1).apply(arrange_row).unstack(level=-1)
pd.version == '1.2.3'

CodePudding user response：

Use df.values with List comprehension and df.dropna:

# Create a list of rows of dataframe
In [788]: l = df.values 

# Use List Comprehension to remove dups from above list of lists
In [789]: l_without_dupes = [list(dict.fromkeys(i)) for i in l]

# Create a new dataframe from above list and drop the column with all NaN's
In [795]: res_df = pd.DataFrame(l_without_dupes).dropna(1, how='all')

In [796]: res_df
Out[796]: 
       0      1      2
0  111.0    NaN    NaN
1  112.0    NaN  115.0
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  118.0  110.0  117.0

CodePudding user response：

Another option, albeit longer:

outcome = (df.melt(ignore_index= False) # keep the index as a tracker
             .reset_index()
            # get the unique rows
             .drop_duplicates(subset=['index','value'])
             .dropna()
            # use this to build the new column names
             .assign(counter = lambda df: df.groupby('index').cumcount()   1)
             .pivot('index', 'counter', 'value')
             .add_prefix('num')
             .reindex(df.index)
             .rename_axis(columns=None)
) 

outcome 

    num1   num2   num3
0  111.0    NaN    NaN
1  112.0  115.0    NaN
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  118.0  110.0  117.0

If you want it to exactly match your output, you can dump it into numpy, sort and return to pandas:

pd.DataFrame(np.sort(outcome, axis = 1), columns = outcome.columns)

    num1   num2   num3
0  111.0    NaN    NaN
1  112.0  115.0    NaN
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  110.0  117.0  118.0