I have a dataframe:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
And a reference dictionary:
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
My goal is to replace all occurrences of replacement_dict['X1']
with 'X1', and then merge the rows together. For example, any instance of 'x1', 'x2', 'x3' or 'x4' will be replaced by 'X1', etc.
I can do this by selecting the rows that contain any of these strings and replacing them with 'X1':
keys = replacement_dict.keys()
for key in keys:
DF.loc[DF['lft'].isin(replacement_dict[key]), 'lft'] = key
DF.loc[DF['rgt'].isin(replacement_dict[key]), 'rgt'] = key
giving:
lft rel rgt num
0 t3 r3 Z1 3
1 t1 r3 X1 9
2 X1 r3 t2 8
3 X1 r1 t2 4
4 t1 r1 Z1 1
5 X1 r1 t2 2
6 X1 r2 t4 4
7 Z1 r2 t4 5
8 t4 r3 X1 4
9 Z1 r2 t3 4
Now, if I select all the rows containing 'X1' and merge them, I should end up with:
lft rel rgt num
0 X1 r3 t2 8
1 X1 r1 t2 6
2 X1 r2 t4 4
3 t1 r3 X1 9
4 t4 r3 X1 4
So the three columns ['lft', 'rel', 'rgt'] are unique while the 'num' column is added up for each of these rows. The row 1 above : ['X1' 'r1' 't2' 6] is the sum of two rows ['X1' 'r1' 't2' 4] and ['X1' 'r1' 't2' 2].
I can do this easily for a small number of rows, but I am working with a dataframe with 6 million rows and a replacement dictionary with 60,000 keys. This is taking forever using a simple row wise extraction and replacement.
How can this (specifically the last part) be scaled efficiently? Is there a pandas trick that someone can recommend?
CodePudding user response:
You can reverse replacement_dict
mapping and sum num
values after grouping by lft, rel and rgt columns.
# reverse replacement map
reverse_map = {v : k for k, li in replacement_dict.items() for v in li}
# change values in lft column using reverse_map
df['lft'] = df['lft'].map(reverse_map).fillna(df['lft'])
# change values in rgt column using reverse_map
df['rgt'] = df['rgt'].map(reverse_map).fillna(df['rgt'])
# sum values in num column by groups
df.groupby(['lft', 'rel', 'rgt'], as_index=False)['num'].sum()
CodePudding user response:
Pandas has built in function replace that is faster than going through the whole dataframe with .loc
You can also pass a list in it making our dictionary good fit for it
keys = replacement_dict.keys()
# Loop through every value in our dictionary and get the replacements
for key in keys:
DF = DF.replace(to_replace=replacement_dict[key], value=key)
CodePudding user response:
If you flip the keys and values of your replacement_dict
, things become a lot easier:
new_replacement_dict = {
v: key
for key, values in replacement_dict.items()
for v in values
}
cols = ["lft", "rel", "rgt"]
df[cols] = df[cols].replace(new_replacement_dict)
df.groupby(cols).sum()
CodePudding user response:
Try this, I commented the steps
#reverse dict to dissolve the lists as values
reversed_dict = {v:k for k,val in replacement_dict.items() for v in val}
# replace the values
cols = ['lft', 'rel', 'rgt']
df[cols] = df[cols].replace(reversed_dict)
# filter rows where X1 is anywhere in the columns
df = df[df.eq('X1').any(axis=1)]
# sum the duplicate rows
out = df_filtered.groupby(cols).sum().reset_index()
print(out)
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
CodePudding user response:
Here's a way to do what your question asks:
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df[(df.lft == 'X1') | (df.rgt == 'X1')]
.groupby(['lft','rel','rgt']).sum().reset_index() )
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
Explanation:
replace()
uses a reversed version of the dictionary to replace items from lists in the original dict with the corresponding keys in the relevant df columnslft
andrgt
- after filtering for rows with
'X1'
found in eitherlft
orrgt
, usegroupby()
,sum()
andreset_index()
to sum thenum
column for uniquelft, rel, rgt
group keys and restore the group components from index levels to columns.
As an alternative, we can use query()
to select only rows containing 'X1'
:
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df.query("lft=='X1' or rgt=='X1'")
.groupby(['lft','rel','rgt']).sum().reset_index() )