Consider the following data frame:
import pandas as pd
import random
characteristics = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
N = int(1e5)
random_characteristics = [random.choice(characteristics) for i in range(N)]
df = pd.DataFrame(data={'character': random_characteristics})
df
--------------------------------------------------------------------
character
0 A
1 G
2 E
3 G
4 D
... ...
99995 E
99996 G
99997 A
99998 D
99999 B
100000 rows × 1 columns
--------------------------------------------------------------------
Now, my goal is to create a new column characteristics_shifted
that is shifted according to the list characteristics
, where the number of shifts can be specified by the user.
For instance, if you specify shift = 1
, then each character is shifted by one. If the character equals H
, then it cannot be shifted by one and therefore remains the same. If shift = 2
and the character equals B
, then I want to get D
. If, in turn, the character equals G
, I want to get H
. The same holds for negative shifts, but in the other direction.
Example:
character characteristics_shifted (shift=1) characteristics_shifted (shift=-2)
A B A
G H F
E F D
G H F
D E C
H H F
H H F
A B A
E F D
C D B
F G E
Note: My data frame contains around 21 mio. rows. It does not contain nan values.
CodePudding user response:
You can craft a Series and use map
on the shifted Series:
c = pd.Series(characteristics, index=characteristics)
shifts = [1, -2]
for s in shifts:
df[f'shift={s}'] = df['character'].map(c.shift(-s).ffill().bfill())
print(df)
output:
character shift=1 shift=-2
0 G H E
1 G H E
2 A B A
3 E F C
4 H H F
... ... ... ...
99995 G H E
99996 E F C
99997 E F C
99998 C D A
99999 D E B
[100000 rows x 3 columns]
optimization
if there are many columns, this variant will be faster (thanks @MichaelSzczesny for pointing this out!):
cat = pd.Series(pd.Categorical(df['character']))
c = pd.Series(pd.Categorical(characteristics), index=characteristics)
shifts = [1, -2]
for s in shifts:
df[f'shift={s}'] = cat.map(c.shift(-s).ffill().bfill())
CodePudding user response:
You could write a helper function to shift the characters, using min
and max
to prevent the index going past the limits of the characteristics
array. You can then apply
that to the character
column:
def shift(char, shift):
idx = max(min(characteristics.index(char) shift, len(characteristics)-1), 0)
return characteristics[idx]
df = pd.DataFrame(data={'character': characteristics})
shifts = [-2, -1, 0, 1, 2]
for s in shifts:
df[f'shift={s}'] = df['character'].apply(lambda c:shift(c, s))
Output:
character shift=-2 shift=-1 shift=0 shift=1 shift=2
0 A A A A B C
1 B A A B C D
2 C A B C D E
3 D B C D E F
4 E C D E F G
5 F D E F G H
6 G E F G H H
7 H F G H H H
CodePudding user response:
Although @mozway's approach is very simple and efficient, I would like to add my solution as well. I think my solution is better in terms of performance, but does include more lines of code.
def _get_index_positions(vector_to_shift: np.ndarray, arr_characteristics: np.ndarray) -> np.ndarray:
"""Get index position of matches."""
# Get sorted index of characteristics
sort_idx = arr_characteristics.argsort()
# Find index position of matches
return sort_idx[np.searchsorted(arr_characteristics, vector_to_shift, sorter=sort_idx)]
def shift(target: np.ndarray, shift: int, index_matches: np.ndarray):
"""Shifts and clips index."""
shifted_index = np.clip(index_matches shift, 0, len(target) - 1)
return target[shifted_index]
index = _get_index_positions(df.character.values, arr_characteristics)
# Define characteristics as array
arr_characteristics = np.array(characteristics)
shifts = [1, 2, 3, 4, -1, -2, -3, -4]
for s in shifts:
df[f'shift={s}'] = shift(arr_characteristics, s, index)
I would mainly rely on numpy
and use indexing to achieve that. My approach takes around 10 seconds
, while @mozway's approach takes around 14
seconds for a 21 Mio row data frame.
Edit
The optimized version of @mozway's approach only takes around 3.5 seconds
on my machine for 21 mio rows. Now, his approach is definitely better in every dimension.
CodePudding user response:
import pandas as pd
import random
characteristics = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
N = int(1e5)
random_characteristics = [random.choice(characteristics) for i in range(N)]
df = pd.DataFrame(data={'character': random_characteristics})
df['shifted'] = [str(chr(ord(i) 1)) for i in df['character']] # 1 shift 1 position
print(df)