I'm a beginner in Pandas. I have a data file containing 10000 different information of users. This data contain 5 columns and 10000 rows. One of these columns is the district of the users and it divides users according to their living place(It defines just 7 different locations and in each of locations some number of users live). as an example, out of this 10000 users, 300 users live in USA and 250 Live in Canada and.. I want to define a DataFrame which includes five random rows of users with the distinct of: USA,Canada,LA,NY and Japan. Also, the dimensions needs to be 20*5. Can you please help me how to do that? I know for choosing random I need to use
s = df.sample(n=5)
but how can I define that choose 5 random information from the users with those locations and define the dimension?
CodePudding user response:
You can also sample from groups generated with groupby
:
df.groupby('district').sample(n=5)
To restrict the sampling to those districts you can filter the df beforehand:
df[df['district'].isin(['USA', 'Canada', 'LA', 'NY', 'Japan'])].groupby('district').sample(n=5)
This is assuming 'district'
is the district column. Also, if I understood correctly, since you are sampling 5 items from 5 districts, the dimension of the final DataFrame should be (5*5)x5 = 25x5 (25 rows and 5 columns).
You need pandas version >= 1.1.0 to use this method.