I have a dataframe that looks like this:
author|string
abc|hi
abc|yo
def|whats
ghi|up
ghi|dog
how can I select only one row per author? I'm at a loss. I want to do something like this:
df.loc[unique authors].sample(n=1000)
and get something like this:
author|string
abc|hi
def|whats
ghi|up
I was thinking of converting the author column to categories, but I don't know where to go from there.
I could just do something like this but it seems stupid.
author_list = df['author'].unique().tolist()
indexes = []
for author in author_list:
indexes.append(df.loc[df['author'] == author].iloc[0].index)
df.iloc[indexes].sample(n=1000)
CodePudding user response:
Use groupby
sample
# sample 1 'string' of each 'author' group
res = df.groupby('author').sample(1)
Output
>>> for _ in range(3):
... print(df.groupby('author').sample(1), '\n')
author string
0 abc hi
2 def whats
3 ghi up
author string
0 abc hi
2 def whats
3 ghi up
author string
1 abc yo
2 def whats
4 ghi dog
Setup:
df = pd.DataFrame({
'author': ['abc', 'abc', 'def', 'ghi', 'ghi'],
'string': ['hi', 'yo', 'whats', 'up', 'dog']
})
Note that the random sampling is done for each column of each group separately. Since you are just sampling one column, it doesn't matter. But if your DataFrame has multiple columns and you want to sample a random row as a whole from each group use
res = df.groupby('author', group_keys=False).apply(pd.DataFrame.sample, n=1)
CodePudding user response:
You can do
out = df.drop_duplicates('author')