I have a pandas dataframe (df) with many subjects.
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 subject 20640 non-null object
1 block 20640 non-null int64
Say I want to subset the df with the first n unique subjects (keeping all rows for those n subjects). Is there an easy command to do that?
CodePudding user response:
Use boolean indexing
with filter first unique values with Series.isin
:
n = 10
df1 = df[df['subject'].isin(df['subject'].unique()[:n])]
Or:
df1 = df[df['subject'].isin(df['subject'].drop_duplicates().head(n))]
If need first unique consecutive values:
print (df)
subject
0 a
1 a
2 b
3 f
4 d
5 g
6 a <-should be removed
7 b <-should be removed
n= 3
s = df['subject'].ne(df['subject'].shift()).cumsum()
print (s)
0 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
Name: subject, dtype: int32
df1 = df[s.le(n)]
print (df1)
subject
0 a
1 a
2 b
3 f