Replace and merge rows in pandas according to condition-CodePudding

I have a dataframe:

   lft rel rgt num
0   t3  r3  z2  3
1   t1  r3  x1  9
2   x2  r3  t2  8
3   x4  r1  t2  4
4   t1  r1  z3  1
5   x1  r1  t2  2
6   x2  r2  t4  4
7   z3  r2  t4  5
8   t4  r3  x3  4
9   z1  r2  t3  4

And a reference dictionary:

replacement_dict = {
    'X1' : ['x1', 'x2', 'x3', 'x4'],
    'Y1' : ['y1', 'y2'],
    'Z1' : ['z1', 'z2', 'z3']
}

My goal is to replace all occurrences of replacement_dict['X1'] with 'X1', and then merge the rows together. For example, any instance of 'x1', 'x2', 'x3' or 'x4' will be replaced by 'X1', etc.

I can do this by selecting the rows that contain any of these strings and replacing them with 'X1':

keys = replacement_dict.keys()
for key in keys:
    DF.loc[DF['lft'].isin(replacement_dict[key]), 'lft'] = key
    DF.loc[DF['rgt'].isin(replacement_dict[key]), 'rgt'] = key

giving:

    lft rel rgt num
0   t3  r3  Z1  3
1   t1  r3  X1  9
2   X1  r3  t2  8
3   X1  r1  t2  4
4   t1  r1  Z1  1
5   X1  r1  t2  2
6   X1  r2  t4  4
7   Z1  r2  t4  5
8   t4  r3  X1  4
9   Z1  r2  t3  4

Now, if I select all the rows containing 'X1' and merge them, I should end up with:

    lft rel rgt num
0   X1  r3  t2  8
1   X1  r1  t2  6
2   X1  r2  t4  4
3   t1  r3  X1  9
4   t4  r3  X1  4

So the three columns ['lft', 'rel', 'rgt'] are unique while the 'num' column is added up for each of these rows. The row 1 above : ['X1' 'r1' 't2' 6] is the sum of two rows ['X1' 'r1' 't2' 4] and ['X1' 'r1' 't2' 2].

I can do this easily for a small number of rows, but I am working with a dataframe with 6 million rows and a replacement dictionary with 60,000 keys. This is taking forever using a simple row wise extraction and replacement.

How can this (specifically the last part) be scaled efficiently? Is there a pandas trick that someone can recommend?

CodePudding user response：

You can reverse replacement_dict mapping and sum num values after grouping by lft, rel and rgt columns.

# reverse replacement map
reverse_map = {v : k for k, li in replacement_dict.items() for v in li}
# change values in lft column using reverse_map
df['lft'] = df['lft'].map(reverse_map).fillna(df['lft'])
# change values in rgt column using reverse_map
df['rgt'] = df['rgt'].map(reverse_map).fillna(df['rgt'])
# sum values in num column by groups
df.groupby(['lft', 'rel', 'rgt'], as_index=False)['num'].sum()

CodePudding user response：

Pandas has built in function replace that is faster than going through the whole dataframe with .loc

You can also pass a list in it making our dictionary good fit for it

keys = replacement_dict.keys()

# Loop through every value in our dictionary and get the replacements

for key in keys:
  DF = DF.replace(to_replace=replacement_dict[key], value=key)

CodePudding user response：

If you flip the keys and values of your replacement_dict, things become a lot easier:

new_replacement_dict = {
    v: key
    for key, values in replacement_dict.items()
    for v in values
}

cols = ["lft", "rel", "rgt"]
df[cols] = df[cols].replace(new_replacement_dict)
df.groupby(cols).sum()

CodePudding user response：

Try this, I commented the steps

#reverse dict to dissolve the lists as values
reversed_dict = {v:k for k,val in replacement_dict.items() for v in val}

# replace the values
cols = ['lft', 'rel', 'rgt']
df[cols] = df[cols].replace(reversed_dict)

# filter rows where X1 is anywhere in the columns
df = df[df.eq('X1').any(axis=1)]

# sum the duplicate rows
out = df_filtered.groupby(cols).sum().reset_index()
print(out)

Output:

  lft rel rgt  num
0  X1  r1  t2    6
1  X1  r2  t4    4
2  X1  r3  t2    8
3  t1  r3  X1    9
4  t4  r3  X1    4

CodePudding user response：

Here's a way to do what your question asks:

df[['lft','rgt']] = ( df[['lft','rgt']]
    .replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df[(df.lft == 'X1') | (df.rgt == 'X1')]
    .groupby(['lft','rel','rgt']).sum().reset_index() )

Output:

  lft rel rgt  num
0  X1  r1  t2    6
1  X1  r2  t4    4
2  X1  r3  t2    8
3  t1  r3  X1    9
4  t4  r3  X1    4

Explanation:

replace() uses a reversed version of the dictionary to replace items from lists in the original dict with the corresponding keys in the relevant df columns lft and rgt
after filtering for rows with 'X1' found in either lft or rgt, use groupby(), sum() and reset_index() to sum the num column for unique lft, rel, rgt group keys and restore the group components from index levels to columns.

As an alternative, we can use query() to select only rows containing 'X1':

df[['lft','rgt']] = ( df[['lft','rgt']]
    .replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df.query("lft=='X1' or rgt=='X1'")
    .groupby(['lft','rel','rgt']).sum().reset_index() )