I have a data frame with users like usera and userb, I need to group by this and each user has its own unique id. I need to get every second user id not by row by id. I managed to get every second id, but it is not good because there can be multiple users. Here is my code with inputs and with outputs:
import pandas as pd
import numpy as np
id = ['11', '11', '11', '15', '15', '15', '23', '23', '25','25','26','26','27','27','27','28','28']
username = ['usera','usera','usera','usera','usera','usera','usera','usera','usera','usera','userb','userb','userb','userb','userb','userb','userb']
date = ['2021-05-04','2021-05-05','2021-05-05','2021-05-06','2021-06-07','2021-06-08','2021-07-09','2021-03-09','2021-04-10','2021-04-10','2021-04-10','2021-04-10','2021-04-10','2021-04-10','2021-04-10','2021-04-10','2021-04-10']
df = pd.DataFrame({'id': id, 'username': username, 'date': date})
df = df.sort_values(by=['id'], ignore_index=True) #Sort because the dataframe not sorted.
# kick out non-unique IDs
unique_ids = np.unique(df['id'])
unique_ids = df.groupby('username')['id'].agg(['unique'])
print("g")
print(unique_ids)
print("gend")
print("g2")
otherframe = pd.DataFrame(unique_ids)
print(otherframe['unique'])
# every 2nd
print(unique_ids[::2])
print("\n\n head")
every_2nd = df[df['id'].isin(unique_ids[::2])]
#every_2nd get new dataframe with every second id grouped by users
#username unique
#usera [11, 15, 23, 25] usera id-s
#userb [26, 27, 28] userb id-s
#usera every second id= [11, 23 ]
#userb every second id= [26, 28] userb id-s
#expected ooutput
#every_second_id_by_user = ['11', '11', '11', '23', '23', '26','26','27','27','27','28','28']
#and every second date=
CodePudding user response:
Edit: @Akshay Sehgal's solution is better.
If I understand the question correctly, I believe what you want can be achieved as:
df.groupby(['username', 'id'])['id'].unique()[::2]
# username id
# usera 11 [11]
# 23 [23]
# userb 26 [26]
# 28 [28]
# Name: id, dtype: object
The key is to group by the username and id before taking the unique values.
CodePudding user response:
Try this -
df.groupby('username')['id'].unique().str[::2]
username
usera [11, 23]
userb [26, 28]
Name: id, dtype: object
If you want to further filter the original data frame for the rows by these ids, use this -
idx = df.groupby('username')['id'].unique().str[::2].explode()
df[df['id'].isin(idx)]
id username date
0 11 usera 2021-05-04
1 11 usera 2021-05-05
2 11 usera 2021-05-05
6 23 usera 2021-07-09
7 23 usera 2021-03-09
10 26 userb 2021-04-10
11 26 userb 2021-04-10
15 28 userb 2021-04-10
16 28 userb 2021-04-10
CodePudding user response:
With pd.factorize
and np.mod
after getting an indexer for id
and username
df[np.mod(pd.factorize(df[['id','username']].to_records(index=False))[0],2)==0]
id username date
0 11 usera 2021-05-04
1 11 usera 2021-05-05
2 11 usera 2021-05-05
6 23 usera 2021-07-09
7 23 usera 2021-03-09
10 26 userb 2021-04-10
11 26 userb 2021-04-10
15 28 userb 2021-04-10
16 28 userb 2021-04-10