How to Create Empty Pandas Series And Merge It With an Other Series-CodePudding

I have a small problem when trying to merge pandas Series together, thank you for your help.

I have a huge file containing 30M rows. I can't have all the data in memory so I used the option chunksize inside the pandas.read_csv method.

Here is an example of the file I have to parse. It represents access by user for certain modules inside the app. I replaced time with x since it is not important in this case. We have data about each user (user_id), the time and the module (module_id) they accessed.

user_id;time;module_id
1;x;101
1;x;101
1;x;102
1;x;102
2;x;102
2;x;102
2;x;103

My task is to go through the whole document and find out how many times each user has accessed each module. I also have an other dataframe (df) with information about each user. The index of df is user_id.

Here is the code I have so far

logs = pd.read_csv(... , chunksize=50000, index_col=user_id)
module_accesses = pd.Series(index = df.index, dtype='object', name=module_id)

for log in logs:
     module_accesses = process(log, module_accesses)

def process(log, module_accesses):
     # Edit: the line below was added
     log[module_id] = log[description].apply(getModuleIdFromDescription)    
     groups = log.groupby([user_id, module_id]).size()
     new_groups = module_accesses   groups 
     return new_groups

When I was initialize module_accesses, it looks like this:

user_id
440250    NaN
428394    NaN
         ... 
746792    NaN
746794    NaN

The variable groups looks like this

user_id     module_id
592844      272         2
            1711        2
            6707        1
594073      326         1
            3094        1
                       ..
649479      288         9
            2302        9
            2312        1
651921      251         2
            948         1

However, when I do new_groups = module_accesses groups the variable new_groups has this value.

592844      272         NaN
            1711        NaN
            6707        NaN
594073      326         NaN
            3094        NaN
                       ... 
649479      288         NaN
            2302        NaN
            2312        NaN
651921      251         NaN
            948         NaN

I want to be able to go through each chunk and update the Serie module_accesses. The count of acceses for each user_id / module_id should increase and at the end I should have the total count of accesses for each user and for each module. The expected result would be like the variable groups but for the whole file. We have a lot of users (350k) and modules (100k). Also, most users only access 10-20 modules. Example of expected result:

user_id     module_id
592844      272         27389
            1711        103827
            6707        194625
594073      326         2846
            3094        103826
                       ..
649479      288         1023
            2302        2038
            2312        12344
651921      251         829
            948         302

CodePudding user response：

I think that you'll need module_accesses to be a pd.DataFramme, because you need to keep track of "number of accesses to a module per user":

module_accesses = pd.DataFrame([], columns=["user_id", "module_id", "n"])

That said, I would iteratively merge module_accesses (current known accesses) and groups (new accesses to register) and then join the column with the number of accesses:

# Slight modification to name the size column "n"
groups = log.groupby(["user_id", "module_id"]).size().to_frame("n")

# Merge known and new accesses
new_groups = pd.merge(module_accesses, groups, how="outer", on=["user_id", "module_id"])
# Combine both counts into a single count
new_groups["n"] = new_groups.pop("n_x").fillna(0)   new_groups.pop("n_y").fillna(0)

Resulting definition for process:

def process(log, module_accesses):
    groups = log.groupby(["user_id", "module_id"]).size().to_frame("n")
    new_groups = pd.merge(module_accesses, groups, how="outer", on=["user_id", "module_id"])
    new_groups["n"] = new_groups.pop("n_x").fillna(0)   new_groups.pop("n_y").fillna(0)
    return new_groups

Another idea:

Use process to return groups (without calculating new_groups)
Accumulate outputs in a list of dataframes
Use pd.concat to create a single big dataframe
Use groupby to reduce the big dataframe into your expected format

CodePudding user response：

Problems:

The sum operation of two pd.Series is conducted index-by-index, and apparently your module_accesses and groups do not share the same set of indices, therefore it doesn't sum as you expected.
Even if you make sure they have the same set of indices, summing a number with a np.nan will give you a np.nan. So you probably want to initialize your module_accesses to zeros instead of NaNs.

However, you can approach this problem from another way, since you have the counts for each chunk of your dataset, you can simplay store them up into a list, and concatenate all of them into a larger Series of counts, then you may groupby again and simply sum the counts.

pd.concat([
    log.assign(module_id=log['description'].apply(getModuleIdFromDescription))\
        .groupby(['user_id', 'module_id'])\
        .size().rename('count')
    for log in logs
]).reset_index().groupby(['user_id', 'module_id'])['count'].sum()

Note that in this way you won't see user_id that doesn't have any count in any of the modules.