I have a DataFrame
arranged in a manner similar to this:
ID Sample_1 Sample_2
A 0.182 0.754
B 0.182 0.754
C 0.182 0.01
D 0.182 0.2
E 0.9 0.2
As you can see, there are some repeated values ("true" measurements) and single values (that are actually "bad" measurements). What I need to do is to replace all unique values (that are so-called "bad") with NAs. This needs to be done for all columns.
In other words, the final dataframe should look like this:
ID Sample_1 Sample_2
A 0.182 0.754
B 0.182 0.754
C 0.182 NaN
D 0.182 0.2
E NaN 0.2
A possible solution I've thought about involves groupby
and filter
to get the index values (like in Get indexes of unique values in column (pandas)) and then replace the values, but the issue is that it works only for one column at a time:
unique_loc = df.groupby("Sample_1").filter(lambda x: len(x) == 1).index
df.loc[unique_loc, "Sample_1"] = np.nan
This means it would need to get repeated for many columns (and I have many in the actual data). Is there a more efficient solution?
CodePudding user response:
Use DataFrame.where
for set NaN
s with mask created by Series.duplicated
per columns in DataFrame.apply
, for selecting all columns without first is used DataFrame.iloc
:
df.iloc[:, 1:]=df.iloc[:, 1:].where(df.iloc[:, 1:].apply(lambda x:x.duplicated(keep=False)))
print (df)
ID Sample_1 Sample_2
0 A 0.182 0.754
1 B 0.182 0.754
2 C 0.182 NaN
3 D 0.182 0.200
4 E NaN 0.200
If ID
is index:
df = df.set_index('ID')
df = df.where(df.apply(lambda x: x.duplicated(keep=False)))
print (df)
Sample_1 Sample_2
ID
A 0.182 0.754
B 0.182 0.754
C 0.182 NaN
D 0.182 0.200
E NaN 0.200
CodePudding user response:
One option is to get the values via np.unique
, and replace them with pd.DataFrame.mask
:
arr, counts = np.unique(df.filter(like='Sample').to_numpy(), return_counts=True)
df.mask(df.isin(arr[counts==1]))
ID Sample_1 Sample_2
0 A 0.182 0.754
1 B 0.182 0.754
2 C 0.182 NaN
3 D 0.182 0.200
4 E NaN 0.200