I have a super large dataset that i'm trying to shrink. My idea is to keep 100 rows by neighborhood.
Here's an overview of my data :
index | name | neighborhood |
---|---|---|
0 | name 1 | neighborhood A |
1 | name 2 | neighborhood A |
2 | name 3 | neighborhood B |
3 | name 4 | neighborhood B |
4 | name 5 | neighborhood C |
5 | name 6 | neighborhood C |
6 | name 7 | neighborhood D |
7 | name 8 | neighborhood D |
8 | name 9 | neighborhood E |
9 | name 10 | neighborhood E |
What is the more efficient way to do so ?
Thanks in advance
I'm expecting to create something that looks like :
index | name | neighborhood |
---|---|---|
0 | name 1 | neighborhood A |
1 | name 3 | neighborhood B |
2 | name 5 | neighborhood C |
3 | name 7 | neighborhood D |
4 | name 9 | neighborhood E |
CodePudding user response:
It depends how you want to select the rows.
first n with groupby.head
:
n = 100
out = df.groupby('neighborhood').head(n)
random n rows with groupby.sample
:
n = 100
out = df.groupby('neighborhood').sample(n=n)
CodePudding user response:
i think, you can use groupby and *nth:
dfx=df.groupby('neighborhood').nth[:100]