I'd like to create a mock dataframe that has a random number of rows per user in Pandas. I attempted it in a for loop (below), but it doesn't work. Here's how I attempted to set it up and what I'd like my ideal dataframe to be.
Let's say I have a much smaller dataframe that looks like this. The key columns are userid
and the seq
.
If a user had 3 instances of baseline
value, it should have 3 rows in the dataframe. The user_id should be the same in each row
df = pd.DataFrame({
'user_id':
{0: 'user122',
1: 'user122',
2: 'user122',
3: 'user124',
4: 'user125',
5: 'user125',
6: 'user126',
7: 'user126',
8: 'user126'},
'baseline':
{0: 4.0,
1: 4.0,
2: 4.0,
3: 2,
4: 4,
5: 4,
6: 5,
7: 5,
8: 5},
'seq':
{0: 1,
1: 2,
2: 3,
3: 1,
4: 1,
5: 2,
6: 1,
7: 2,
8: 3},
})
In the mock dataset, some users have 3 instances, some have less.
I tried to create just the user and seq columns in a for loop below in a very small dataframe, but the user_id doesn't always match correctly with the number of rows it should have. I imagine there's a much better way of creating this dataframe. I'm open to any approaches
import random
lis = []
randomrange = [2,3]
user_experience_column = []
usernumber = []
i=0
for i in range (2):
print ('this is i, ', i)
usernumber_id = 'user' str(i)
print ('userid', str(i) )
i = i 1
for user_rows in range (random.randint(1, 5)):
print ('userid selected in random loop', usernumber_id )
print ('user rows random', user_rows)
user_experience_column.append(user_rows)
print ('user experience column', user_experience_column)
usernumber.extend([usernumber_id for i in range(user_rows)])
print ('user number', usernumber_id)
df = pd.DataFrame(list(zip(user_experience_column, usernumber)),
columns =['experiencenumber', 'userid'])
df
Now, in my ideal dataframe, I'll have 1000 users, each will have between 3-8 rows, and 60% will have 6 rows or more. But, for now, I'd be happy with code that could correctly create 100 users with two columns: the user_id and the correct sequence number (so if a user has 2 instances, the user_id would appear in 2 rows and they would be labled "1" and "2" respectively).
Thanks and please let me know if I can make this question any more clear.
CodePudding user response:
import pandas as pd
import numpy as np
# We only need one from each:
df = df.groupby('user_id', as_index=False)[['baseline']].max()
# For each row, we'll produce a number between baseline and 8, (9 isn't included),
# but for 60% of them, it'll be between 6 and 8.
repeat_num = np.where(np.random.random(len(df)) > 0.4,
np.random.randint(df.baseline.mask(df.baseline.lt(6), 6), 9),
np.random.randint(df.baseline, 9))
# Then we'll repeat each row this number of times:
df = pd.DataFrame(df.to_numpy().repeat(repeat_num, axis=0), columns=df.columns)
# Now we can add the sequence column:
df['seq'] = df.groupby('user_id').cumcount() 1
print(df)
Output (Your results will obviously vary):
user_id baseline seq
0 user122 4.0 1
1 user122 4.0 2
2 user122 4.0 3
3 user122 4.0 4
4 user122 4.0 5
5 user122 4.0 6
6 user122 4.0 7
7 user122 4.0 8
8 user124 2.0 1
9 user124 2.0 2
10 user124 2.0 3
11 user124 2.0 4
12 user124 2.0 5
13 user124 2.0 6
14 user125 4.0 1
15 user125 4.0 2
16 user125 4.0 3
17 user125 4.0 4
18 user126 5.0 1
19 user126 5.0 2
20 user126 5.0 3
21 user126 5.0 4
22 user126 5.0 5
23 user126 5.0 6
CodePudding user response:
Is this what you're looking for?
import numpy as np
import pandas as pd
# Define the number of User IDs to generate
num_users = 4
# First condition occurs with a probability of 40%
min_rows_per_user_cound1 = 3
max_rows_per_user_cound1 = 6
# Second condition occurs with a probability of 60%
min_rows_per_user_cound2 = 6
max_rows_per_user_cound2 = 8
# Create the user IDs
user_ids = list(map(lambda value: f"user{value 1}", range(num_users)))
# Create the rows for each user, where each user can appear between 1 and 4 times.
user_rows = [
user_id
for user in user_ids
for user_id in [user]
* (
np.random.randint(min_rows_per_user_cound2, max_rows_per_user_cound2)
if np.random.random() >= 0.4
else np.random.randint(
min_rows_per_user_cound1, max_rows_per_user_cound1
)
)
]
# Create the mock dataframe
mock_df = (
pd.DataFrame(
zip(user_rows, [0] * len(user_rows)),
columns=["userid", "experiencenumber"],
)
.assign(
experiencenumber=lambda xdf: xdf.groupby("userid")[
"experiencenumber"
].cumcount()
1
)
.assign(
baseline=lambda xdf: xdf.groupby("userid")[
"experiencenumber"
].transform("count")
)
)
print(mock_df)
# Prints:
#
# userid experiencenumber baseline
# 0 user1 1 3
# 1 user1 2 3
# 2 user1 3 3
# 3 user2 1 6
# 4 user2 2 6
# 5 user2 3 6
# 6 user2 4 6
# 7 user2 5 6
# 8 user2 6 6
# 9 user3 1 3
# 10 user3 2 3
# 11 user3 3 3
# 12 user4 1 7
# 13 user4 2 7
# 14 user4 3 7
# 15 user4 4 7
# 16 user4 5 7
# 17 user4 6 7
# 18 user4 7 7
Edit
Answering tom's questions:
Question 1: What [user] * number
means?
[user] * number
enables you to create lists. The user
inside the code is one of the user IDs ("user1"
, "user2"
, etc.). So [user] * number
is just a fancy way of repeating a given User ID $n$ times. There are other ways you can use to generate the same end result:
# -- Using For/Loop --------------------
l = []
for _ in range(some_number):
l.append(some_value)
# Or (not recommended):
l = []
for _ in range(some_number):
l = [*l, some_value]
# Example:
l = []
for _ in range(4):
l.append('user1')
print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']
l = []
for _ in range(4):
l = [*l, 'user1']
print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']
# -- Using List Comprehension ----------
[some_value for _ in range(some_number)]
# Or:
[some_value for i in range(some_number)]
# Example:
l = ['user1' for _ in range(4)]
print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']
# -- Using `[user] * number` syntax ---------------
l = ['user1'] * 4
print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']
# -- Using `map` (not recommended - Hard to read) ---------------
#
l = list(map(lambda _: 'user1', range(4)))
print(l)
print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']
Question 2: Why didn't I indent the for/loop
?
Python, offers a shorter syntax when you want to create a new list based on the values of an existing list, called list comprehensions. To build the user_rows
list, we're actually iterating through the list user_ids
(which was created beforehand and contains the IDs for each user). Then we take each user ID and create another list repeating them a random number of times. Admittedly this way of writing code is confusing and hard to read. Here's the same code, but using for/loop
instead:
user_rows = []
# For each user ID we've created ('user1', 'user2', ...)
for user in user_ids:
# np.random.random() returns a number between 0 and 1 following
# a normal distribution. Threfore, we can expect that about
# 60% of the time, the number being generated will be greater
# than 0.4. This means that the first condition will be
# satisfied about 60% of the time, and the second will be
# satisfied about 40% of the time.
# This means that 60% of the users IDs will be repeated 6 to 8 times.
if np.random.random() >= 0.4: # Same as `if random.random() >= 0.4:...`
user_repeats = np.random.randint(6, 9)
# This means that 40% of the users IDs will be repeated 3 to 5 times.
else:
user_repeats = np.random.randint(3, 6)
# Now that we have randomly generated the number of times
# that the user ID will be repeated, we can add that user ID
# user_repeats times to our list.
for _ in range(user_repeats):
user_rows.append(user)
# [OPTIONAL] The for/loop above is equivalent to the following:
# user_rows.extend(user for _ in range(user_repeats))
print(user_rows)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1', 'user1', 'user1', 'user1', 'user2', 'user2', ...]
Question 3: .transform("count")
transform
lets you broadcast a groupby
operation across all values from the same group. This allows you to return something with the same index as what you started with. Here's an example:
Without .transform
:
mock_df.groupby('userid', as index=False)['experiencenumber'].count()
# Prints:
#
# userid experiencenumber
# 0 user1 3
# 1 user2 3
# 2 user3 6
With .transform
:
mock_df.groupby('userid')['experiencenumber'].transform('count')
# Prints:
# userid experiencenumber
# 0 user1 3
# 1 user1 3
# 2 user1 3
# 3 user2 3
# 4 user2 3
# 5 user2 3
# 6 user3 6
# 7 user3 6
# 8 user3 6
# 9 user3 6
# 10 user3 6
# 11 user3 6