Home > Enterprise >  Pandas dataframe: change unique values in each column to NaNs
Pandas dataframe: change unique values in each column to NaNs

Time:01-10

I have a DataFrame arranged in a manner similar to this:

ID   Sample_1   Sample_2
A    0.182      0.754
B    0.182      0.754
C    0.182      0.01
D    0.182      0.2
E    0.9        0.2

As you can see, there are some repeated values ("true" measurements) and single values (that are actually "bad" measurements). What I need to do is to replace all unique values (that are so-called "bad") with NAs. This needs to be done for all columns.

In other words, the final dataframe should look like this:

ID   Sample_1   Sample_2
A    0.182      0.754
B    0.182      0.754
C    0.182      NaN
D    0.182      0.2
E    NaN        0.2

A possible solution I've thought about involves groupby and filter to get the index values (like in Get indexes of unique values in column (pandas)) and then replace the values, but the issue is that it works only for one column at a time:

unique_loc = df.groupby("Sample_1").filter(lambda x: len(x) == 1).index
df.loc[unique_loc, "Sample_1"] = np.nan

This means it would need to get repeated for many columns (and I have many in the actual data). Is there a more efficient solution?

CodePudding user response:

Use DataFrame.where for set NaNs with mask created by Series.duplicated per columns in DataFrame.apply, for selecting all columns without first is used DataFrame.iloc:

df.iloc[:, 1:]=df.iloc[:, 1:].where(df.iloc[:, 1:].apply(lambda x:x.duplicated(keep=False)))
print (df)
  ID  Sample_1  Sample_2
0  A     0.182     0.754
1  B     0.182     0.754
2  C     0.182       NaN
3  D     0.182     0.200
4  E       NaN     0.200

If ID is index:

df = df.set_index('ID')

df = df.where(df.apply(lambda x: x.duplicated(keep=False)))
print (df)
    Sample_1  Sample_2
ID                    
A      0.182     0.754
B      0.182     0.754
C      0.182       NaN
D      0.182     0.200
E        NaN     0.200

CodePudding user response:

One option is to get the values via np.unique, and replace them with pd.DataFrame.mask:

arr, counts = np.unique(df.filter(like='Sample').to_numpy(), return_counts=True)
df.mask(df.isin(arr[counts==1]))

  ID  Sample_1  Sample_2
0  A     0.182     0.754
1  B     0.182     0.754
2  C     0.182       NaN
3  D     0.182     0.200
4  E       NaN     0.200

  • Related