Consider the below dataframe:
import pandas as pd
from numpy import nan
data = [
(111, nan, nan, 111),
(112, 112, nan, 115),
(113, nan, nan, nan),
(nan, nan, nan, nan),
(118, 110, 117, nan),
]
df = pd.DataFrame(data, columns=[f'num{i}' for i in range(len(data[0]))])
num0 num1 num2 num3
0 111.0 NaN NaN 111.0
1 112.0 112.0 NaN 115.0
2 113.0 NaN NaN NaN
3 NaN NaN NaN NaN
4 118.0 110.0 117.0 NaN
Assuming my index is unique, I'm looking to retrieve the unique values per index row, to an output like the one below. I wish to keep the empty rows.
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 110.0 117.0 118.0
I have a working, albeit slow, solution, see below. The output number order is not relevant, as long all values are presented to the leftmost column and nulls to the right. I'm looking for best practices and potential ideas to speed up the code. Thank you in advance.
def arrange_row(row):
values = list(set(row.dropna(axis=1).values[0]))
values = [nan] if not values else values
series = pd.Series(values, index=[f"num{i}" for i in range(1, len(values) 1)])
return series
df.groupby(level=-1).apply(arrange_row).unstack(level=-1)
pd.version == '1.2.3'
CodePudding user response:
Use df.values
with List comprehension
and df.dropna
:
# Create a list of rows of dataframe
In [788]: l = df.values
# Use List Comprehension to remove dups from above list of lists
In [789]: l_without_dupes = [list(dict.fromkeys(i)) for i in l]
# Create a new dataframe from above list and drop the column with all NaN's
In [795]: res_df = pd.DataFrame(l_without_dupes).dropna(1, how='all')
In [796]: res_df
Out[796]:
0 1 2
0 111.0 NaN NaN
1 112.0 NaN 115.0
2 113.0 NaN NaN
3 NaN NaN NaN
4 118.0 110.0 117.0
CodePudding user response:
Another option, albeit longer:
outcome = (df.melt(ignore_index= False) # keep the index as a tracker
.reset_index()
# get the unique rows
.drop_duplicates(subset=['index','value'])
.dropna()
# use this to build the new column names
.assign(counter = lambda df: df.groupby('index').cumcount() 1)
.pivot('index', 'counter', 'value')
.add_prefix('num')
.reindex(df.index)
.rename_axis(columns=None)
)
outcome
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 118.0 110.0 117.0
If you want it to exactly match your output, you can dump it into numpy, sort and return to pandas:
pd.DataFrame(np.sort(outcome, axis = 1), columns = outcome.columns)
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 110.0 117.0 118.0