Create random number of user rows and sequential integers in pandas-CodePudding

I'd like to create a mock dataframe that has a random number of rows per user in Pandas. I attempted it in a for loop (below), but it doesn't work. Here's how I attempted to set it up and what I'd like my ideal dataframe to be.

Let's say I have a much smaller dataframe that looks like this. The key columns are userid and the seq.

If a user had 3 instances of baseline value, it should have 3 rows in the dataframe. The user_id should be the same in each row

df = pd.DataFrame({
  'user_id': 
  {0: 'user122',
  1: 'user122',
  2: 'user122',
  3: 'user124',
  4: 'user125',
  5: 'user125',
  6: 'user126',
  7: 'user126',
  8: 'user126'},
  'baseline': 
  {0: 4.0,
  1: 4.0,
  2: 4.0,
  3: 2,
  4: 4,
  5: 4,
  6: 5,
  7: 5,
  8: 5},
  'seq': 
  {0: 1,
  1: 2,
  2: 3,
  3: 1,
  4: 1,
  5: 2,
  6: 1,
  7: 2,
  8: 3},
})

In the mock dataset, some users have 3 instances, some have less.

I tried to create just the user and seq columns in a for loop below in a very small dataframe, but the user_id doesn't always match correctly with the number of rows it should have. I imagine there's a much better way of creating this dataframe. I'm open to any approaches

import random


lis = []
randomrange = [2,3]
user_experience_column = []
usernumber = []


i=0
for i in range (2):
  print ('this is i, ', i)
  usernumber_id = 'user' str(i)
  print ('userid', str(i) )
  i = i   1
  for user_rows in range (random.randint(1, 5)):
    print ('userid selected in random loop', usernumber_id )
    print ('user rows random', user_rows)
    user_experience_column.append(user_rows)
    print ('user experience column', user_experience_column)
    usernumber.extend([usernumber_id for i in range(user_rows)])
    print ('user number', usernumber_id)

df = pd.DataFrame(list(zip(user_experience_column, usernumber)),
               columns =['experiencenumber', 'userid'])

df

Now, in my ideal dataframe, I'll have 1000 users, each will have between 3-8 rows, and 60% will have 6 rows or more. But, for now, I'd be happy with code that could correctly create 100 users with two columns: the user_id and the correct sequence number (so if a user has 2 instances, the user_id would appear in 2 rows and they would be labled "1" and "2" respectively).

Thanks and please let me know if I can make this question any more clear.

CodePudding user response：

import pandas as pd
import numpy as np

# We only need one from each:
df = df.groupby('user_id', as_index=False)[['baseline']].max()

# For each row, we'll produce a number between baseline and 8, (9 isn't included),
# but for 60% of them, it'll be between 6 and 8.
repeat_num = np.where(np.random.random(len(df)) > 0.4,
                      np.random.randint(df.baseline.mask(df.baseline.lt(6), 6), 9),
                      np.random.randint(df.baseline, 9))

# Then we'll repeat each row this number of times:
df = pd.DataFrame(df.to_numpy().repeat(repeat_num, axis=0), columns=df.columns)

# Now we can add the sequence column:
df['seq'] = df.groupby('user_id').cumcount()   1

print(df)

Output (Your results will obviously vary):

    user_id baseline  seq
0   user122      4.0    1
1   user122      4.0    2
2   user122      4.0    3
3   user122      4.0    4
4   user122      4.0    5
5   user122      4.0    6
6   user122      4.0    7
7   user122      4.0    8
8   user124      2.0    1
9   user124      2.0    2
10  user124      2.0    3
11  user124      2.0    4
12  user124      2.0    5
13  user124      2.0    6
14  user125      4.0    1
15  user125      4.0    2
16  user125      4.0    3
17  user125      4.0    4
18  user126      5.0    1
19  user126      5.0    2
20  user126      5.0    3
21  user126      5.0    4
22  user126      5.0    5
23  user126      5.0    6

CodePudding user response：

Is this what you're looking for?

import numpy as np
import pandas as pd

# Define the number of User IDs to generate
num_users = 4

# First condition occurs with a probability of 40%
min_rows_per_user_cound1 = 3
max_rows_per_user_cound1 = 6

# Second condition occurs with a probability of 60%
min_rows_per_user_cound2 = 6
max_rows_per_user_cound2 = 8

# Create the user IDs
user_ids = list(map(lambda value: f"user{value   1}", range(num_users)))
# Create the rows for each user, where each user can appear between 1 and 4 times.
user_rows = [
    user_id
    for user in user_ids
    for user_id in [user]
    * (
        np.random.randint(min_rows_per_user_cound2, max_rows_per_user_cound2)
        if np.random.random() >= 0.4
        else np.random.randint(
            min_rows_per_user_cound1, max_rows_per_user_cound1
        )
    )
]
# Create the mock dataframe
mock_df = (
    pd.DataFrame(
        zip(user_rows, [0] * len(user_rows)),
        columns=["userid", "experiencenumber"],
    )
    .assign(
        experiencenumber=lambda xdf: xdf.groupby("userid")[
            "experiencenumber"
        ].cumcount()
          1
    )
    .assign(
        baseline=lambda xdf: xdf.groupby("userid")[
            "experiencenumber"
        ].transform("count")
    )
)
print(mock_df)
# Prints:
#
#    userid  experiencenumber  baseline
# 0   user1                 1         3
# 1   user1                 2         3
# 2   user1                 3         3
# 3   user2                 1         6
# 4   user2                 2         6
# 5   user2                 3         6
# 6   user2                 4         6
# 7   user2                 5         6
# 8   user2                 6         6
# 9   user3                 1         3
# 10  user3                 2         3
# 11  user3                 3         3
# 12  user4                 1         7
# 13  user4                 2         7
# 14  user4                 3         7
# 15  user4                 4         7
# 16  user4                 5         7
# 17  user4                 6         7
# 18  user4                 7         7

Edit

Answering tom's questions:

Question 1: What `[user] * number` means?

[user] * number enables you to create lists. The user inside the code is one of the user IDs ("user1", "user2", etc.). So [user] * number is just a fancy way of repeating a given User ID $n$ times. There are other ways you can use to generate the same end result:


# -- Using For/Loop --------------------
l = []
for _ in range(some_number):
    l.append(some_value)

# Or (not recommended):
l = []
for _ in range(some_number):
    l = [*l, some_value]

# Example:
l = []
for _ in range(4):
    l.append('user1')

print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']

l = []
for _ in range(4):
    l = [*l, 'user1']

print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']

# -- Using List Comprehension ----------

[some_value for _ in range(some_number)]

# Or:
[some_value for i in range(some_number)]

# Example:
l = ['user1' for _ in range(4)]

print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']


# -- Using `[user] * number` syntax ---------------

l = ['user1'] * 4
print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']

# -- Using `map` (not recommended - Hard to read) ---------------
#
l = list(map(lambda _: 'user1', range(4)))
print(l)
print(l)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1']

Question 2: Why didn't I indent the `for/loop`?

Python, offers a shorter syntax when you want to create a new list based on the values of an existing list, called list comprehensions. To build the user_rows list, we're actually iterating through the list user_ids (which was created beforehand and contains the IDs for each user). Then we take each user ID and create another list repeating them a random number of times. Admittedly this way of writing code is confusing and hard to read. Here's the same code, but using for/loop instead:

user_rows = []

# For each user ID we've created ('user1', 'user2', ...)
for user in user_ids:
    # np.random.random() returns a number between 0 and 1 following
    # a normal distribution. Threfore, we can expect that about
    # 60% of the time, the number being generated will be greater
    # than 0.4. This means that the first condition will be
    # satisfied about 60% of the time, and the second will be
    # satisfied about 40% of the time.
    # This means that 60% of the users IDs will be repeated 6 to 8 times.
    if np.random.random() >= 0.4: # Same as `if random.random() >= 0.4:...`
        user_repeats = np.random.randint(6, 9)


    # This means that 40% of the users IDs will be repeated 3 to 5 times.
    else:
        user_repeats = np.random.randint(3, 6)

    # Now that we have randomly generated the number of times
    # that the user ID will be repeated, we can add that user ID
    # user_repeats times to our list.
    for _ in range(user_repeats):
        user_rows.append(user)

    # [OPTIONAL] The for/loop above is equivalent to the following:
    # user_rows.extend(user for _ in range(user_repeats))

print(user_rows)
# Prints:
#
# ['user1', 'user1', 'user1', 'user1', 'user1', 'user1', 'user1', 'user2', 'user2', ...]

Question 3: `.transform("count")`

transform lets you broadcast a groupby operation across all values from the same group. This allows you to return something with the same index as what you started with. Here's an example:

Without .transform:


mock_df.groupby('userid', as index=False)['experiencenumber'].count()
# Prints:
#
#   userid  experiencenumber
# 0  user1                 3
# 1  user2                 3
# 2  user3                 6

With .transform:


mock_df.groupby('userid')['experiencenumber'].transform('count')

# Prints:
#   userid  experiencenumber
# 0  user1                 3
# 1  user1                 3
# 2  user1                 3
# 3  user2                 3
# 4  user2                 3
# 5  user2                 3
# 6  user3                 6
# 7  user3                 6
# 8  user3                 6
# 9  user3                 6
# 10 user3                 6
# 11 user3                 6

Edit

Question 1: What [user] * number means?

Question 2: Why didn't I indent the for/loop?

Question 3: .transform("count")

Question 1: What `[user] * number` means?

Question 2: Why didn't I indent the `for/loop`?

Question 3: `.transform("count")`