How to shift values in a data frame according the a list?-CodePudding

Consider the following data frame:

import pandas as pd
import random

characteristics = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
N = int(1e5)
random_characteristics = [random.choice(characteristics) for i in range(N)]

df = pd.DataFrame(data={'character': random_characteristics})

df

--------------------------------------------------------------------
       character
0          A
1          G
2          E
3          G
4          D
...        ...
99995      E
99996      G
99997      A
99998      D
99999      B

100000 rows × 1 columns
--------------------------------------------------------------------

Now, my goal is to create a new column characteristics_shifted that is shifted according to the list characteristics, where the number of shifts can be specified by the user. For instance, if you specify shift = 1, then each character is shifted by one. If the character equals H, then it cannot be shifted by one and therefore remains the same. If shift = 2 and the character equals B, then I want to get D. If, in turn, the character equals G, I want to get H. The same holds for negative shifts, but in the other direction.

Example:

character  characteristics_shifted (shift=1)  characteristics_shifted (shift=-2)  

A          B                                  A
G          H                                  F
E          F                                  D
G          H                                  F
D          E                                  C
H          H                                  F
H          H                                  F
A          B                                  A
E          F                                  D
C          D                                  B
F          G                                  E

Note: My data frame contains around 21 mio. rows. It does not contain nan values.

CodePudding user response：

You can craft a Series and use map on the shifted Series:

c = pd.Series(characteristics, index=characteristics)

shifts = [1, -2]

for s in shifts:
    df[f'shift={s}'] = df['character'].map(c.shift(-s).ffill().bfill())
    
print(df)

output:

      character shift=1 shift=-2
0             G       H        E
1             G       H        E
2             A       B        A
3             E       F        C
4             H       H        F
...         ...     ...      ...
99995         G       H        E
99996         E       F        C
99997         E       F        C
99998         C       D        A
99999         D       E        B

[100000 rows x 3 columns]

optimization

if there are many columns, this variant will be faster (thanks @MichaelSzczesny for pointing this out!):

cat = pd.Series(pd.Categorical(df['character']))
c = pd.Series(pd.Categorical(characteristics), index=characteristics)

shifts = [1, -2]

for s in shifts:
    df[f'shift={s}'] = cat.map(c.shift(-s).ffill().bfill())

CodePudding user response：

You could write a helper function to shift the characters, using min and max to prevent the index going past the limits of the characteristics array. You can then apply that to the character column:

def shift(char, shift):
    idx = max(min(characteristics.index(char)   shift, len(characteristics)-1), 0)
    return characteristics[idx]

df = pd.DataFrame(data={'character': characteristics})
shifts = [-2, -1, 0, 1, 2]
for s in shifts:
    df[f'shift={s}'] = df['character'].apply(lambda c:shift(c, s))

Output:

  character shift=-2 shift=-1 shift=0 shift=1 shift=2
0         A        A        A       A       B       C
1         B        A        A       B       C       D
2         C        A        B       C       D       E
3         D        B        C       D       E       F
4         E        C        D       E       F       G
5         F        D        E       F       G       H
6         G        E        F       G       H       H
7         H        F        G       H       H       H

CodePudding user response：

Although @mozway's approach is very simple and efficient, I would like to add my solution as well. I think my solution is better in terms of performance, but does include more lines of code.


def _get_index_positions(vector_to_shift: np.ndarray, arr_characteristics: np.ndarray) -> np.ndarray:
    """Get index position of matches."""
    # Get sorted index of characteristics
    sort_idx = arr_characteristics.argsort()
    # Find index position of matches
    return sort_idx[np.searchsorted(arr_characteristics, vector_to_shift, sorter=sort_idx)]

def shift(target: np.ndarray, shift: int, index_matches: np.ndarray):
    """Shifts and clips index."""
    shifted_index = np.clip(index_matches   shift, 0, len(target) - 1)
    return target[shifted_index]

index = _get_index_positions(df.character.values, arr_characteristics)

# Define characteristics as array
arr_characteristics = np.array(characteristics)

shifts = [1, 2, 3, 4, -1, -2, -3, -4]
for s in shifts:
    df[f'shift={s}'] = shift(arr_characteristics, s, index)

I would mainly rely on numpy and use indexing to achieve that. My approach takes around 10 seconds, while @mozway's approach takes around 14 seconds for a 21 Mio row data frame.

Edit

The optimized version of @mozway's approach only takes around 3.5 seconds on my machine for 21 mio rows. Now, his approach is definitely better in every dimension.

CodePudding user response：

import pandas as pd
import random

characteristics = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
N = int(1e5)
random_characteristics = [random.choice(characteristics) for i in range(N)]

df = pd.DataFrame(data={'character': random_characteristics})

df['shifted'] = [str(chr(ord(i) 1)) for i in df['character']] # 1 shift 1 position
print(df)