How can I select one of each category with a Pandas DataFrame in Python?-CodePudding

I have a dataframe that looks like this:

author|string
abc|hi
abc|yo
def|whats
ghi|up
ghi|dog

how can I select only one row per author? I'm at a loss. I want to do something like this:

df.loc[unique authors].sample(n=1000)

and get something like this:

author|string
abc|hi
def|whats
ghi|up

I was thinking of converting the author column to categories, but I don't know where to go from there.

I could just do something like this but it seems stupid.

author_list = df['author'].unique().tolist()
indexes = []
for author in author_list:
  indexes.append(df.loc[df['author'] == author].iloc[0].index)
df.iloc[indexes].sample(n=1000)

CodePudding user response：

Use groupby sample

# sample 1 'string' of each 'author' group 
res = df.groupby('author').sample(1)

Output

>>> for _ in range(3):
...     print(df.groupby('author').sample(1), '\n')


  author string
0    abc     hi
2    def  whats
3    ghi     up 

  author string
0    abc     hi
2    def  whats
3    ghi     up 

  author string
1    abc     yo
2    def  whats
4    ghi    dog

Setup:

df = pd.DataFrame({
    'author': ['abc', 'abc', 'def', 'ghi', 'ghi'],
    'string': ['hi', 'yo', 'whats', 'up', 'dog']
})

Note that the random sampling is done for each column of each group separately. Since you are just sampling one column, it doesn't matter. But if your DataFrame has multiple columns and you want to sample a random row as a whole from each group use

res = df.groupby('author', group_keys=False).apply(pd.DataFrame.sample, n=1)

CodePudding user response：

You can do

out = df.drop_duplicates('author')