I have a small problem when trying to merge pandas Series together, thank you for your help.
I have a huge file containing 30M rows. I can't have all the data in memory so I used the option chunksize inside the pandas.read_csv method.
Here is an example of the file I have to parse. It represents access by user for certain modules inside the app. I replaced time with x since it is not important in this case. We have data about each user (user_id), the time and the module (module_id) they accessed.
user_id;time;module_id
1;x;101
1;x;101
1;x;102
1;x;102
2;x;102
2;x;102
2;x;103
My task is to go through the whole document and find out how many times each user has accessed each module. I also have an other dataframe (df) with information about each user. The index of df is user_id.
Here is the code I have so far
logs = pd.read_csv(... , chunksize=50000, index_col=user_id)
module_accesses = pd.Series(index = df.index, dtype='object', name=module_id)
for log in logs:
module_accesses = process(log, module_accesses)
def process(log, module_accesses):
# Edit: the line below was added
log[module_id] = log[description].apply(getModuleIdFromDescription)
groups = log.groupby([user_id, module_id]).size()
new_groups = module_accesses groups
return new_groups
When I was initialize module_accesses
, it looks like this:
user_id
440250 NaN
428394 NaN
...
746792 NaN
746794 NaN
The variable groups
looks like this
user_id module_id
592844 272 2
1711 2
6707 1
594073 326 1
3094 1
..
649479 288 9
2302 9
2312 1
651921 251 2
948 1
However, when I do new_groups = module_accesses groups
the variable new_groups
has this value.
592844 272 NaN
1711 NaN
6707 NaN
594073 326 NaN
3094 NaN
...
649479 288 NaN
2302 NaN
2312 NaN
651921 251 NaN
948 NaN
I want to be able to go through each chunk and update the Serie module_accesses
. The count of acceses for each user_id / module_id
should increase and at the end I should have the total count of accesses for each user and for each module. The expected result would be like the variable groups
but for the whole file. We have a lot of users (350k) and modules (100k). Also, most users only access 10-20 modules. Example of expected result:
user_id module_id
592844 272 27389
1711 103827
6707 194625
594073 326 2846
3094 103826
..
649479 288 1023
2302 2038
2312 12344
651921 251 829
948 302
CodePudding user response:
I think that you'll need module_accesses
to be a pd.DataFramme
, because you need to keep track of "number of accesses to a module per user":
module_accesses = pd.DataFrame([], columns=["user_id", "module_id", "n"])
That said, I would iteratively merge module_accesses
(current known accesses) and groups
(new accesses to register) and then join the column with the number of accesses:
# Slight modification to name the size column "n"
groups = log.groupby(["user_id", "module_id"]).size().to_frame("n")
# Merge known and new accesses
new_groups = pd.merge(module_accesses, groups, how="outer", on=["user_id", "module_id"])
# Combine both counts into a single count
new_groups["n"] = new_groups.pop("n_x").fillna(0) new_groups.pop("n_y").fillna(0)
Resulting definition for process
:
def process(log, module_accesses):
groups = log.groupby(["user_id", "module_id"]).size().to_frame("n")
new_groups = pd.merge(module_accesses, groups, how="outer", on=["user_id", "module_id"])
new_groups["n"] = new_groups.pop("n_x").fillna(0) new_groups.pop("n_y").fillna(0)
return new_groups
Another idea:
- Use process to return
groups
(without calculatingnew_groups
) - Accumulate outputs in a list of dataframes
- Use
pd.concat
to create a single big dataframe - Use
groupby
to reduce the big dataframe into your expected format
CodePudding user response:
Problems:
The sum operation of two
pd.Series
is conducted index-by-index, and apparently yourmodule_accesses
andgroups
do not share the same set of indices, therefore it doesn't sum as you expected.Even if you make sure they have the same set of indices, summing a number with a np.nan will give you a np.nan. So you probably want to initialize your module_accesses to zeros instead of NaNs.
However, you can approach this problem from another way, since you have the counts for each chunk of your dataset, you can simplay store them up into a list, and concatenate all of them into a larger Series of counts, then you may groupby
again and simply sum the counts.
pd.concat([
log.assign(module_id=log['description'].apply(getModuleIdFromDescription))\
.groupby(['user_id', 'module_id'])\
.size().rename('count')
for log in logs
]).reset_index().groupby(['user_id', 'module_id'])['count'].sum()
Note that in this way you won't see user_id that doesn't have any count in any of the modules.