I have the following dataframe with two columns c1
and c2
, I want to add a new column c3
based on the following logic, what I have works but is slow, can anyone suggest a way to vectorize this?
- Must be grouped based on
c1
andc2
, then for each group, the new columnc3
must be populated sequentially fromvalues
where the key is the value ofc1
and each "sub group" will have subsequent values, IOWvalues[value_of_c1][idx]
, whereidx
is the "sub group", example below - The first group
(1, 'a')
, herec1
is1
, the "sub group""a"
index is0
(first sub group of 1) soc3
for all rows in this group isvalues[1][0]
- The second group
(1, 'b')
herec1
is still1
but "sub group" is"b"
so index1
(second sub group of 1) so for all rows in this groupc3
isvalues[1][1]
- The third group
(2, 'y')
herec1
is now2
, "sub group" is"a"
and the index is0
(first sub group of 2), so for all rows in this groupc3
isvalues[2][0]
- And so on
values
will have the necessary elements to satisfy this logic.
Code
import pandas as pd
df = pd.DataFrame(
{
"c1": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
"c2": ["a", "a", "a", "b", "b", "b", "y", "y", "y", "z", "z", "z"],
}
)
new_df = pd.DataFrame()
values = {1: ["a1", "a2"], 2: ["b1", "b2"]}
for i, j in df.groupby("c1"):
for idx, (k, l) in enumerate(j.groupby("c2")):
l["c3"] = values[i][idx]
new_df = new_df.append(l)
Output (works but my code is slow)
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
CodePudding user response:
If you don't mind using another library, you basically need to label encode within your groups:
from sklearn.preprocessing import LabelEncoder
def le(x):
return pd.DataFrame(LabelEncoder().fit_transform(x),index=x.index)
df['idx'] = df.groupby('c1')['c2'].apply(le)
df['c3'] = df.apply(lambda x:values[x['c1']][x['idx']],axis=1)
c1 c2 idx c3
0 1 a 0 a1
1 1 a 0 a1
2 1 a 0 a1
3 1 b 1 a2
4 1 b 1 a2
5 1 b 1 a2
6 2 y 0 b1
7 2 y 0 b1
8 2 y 0 b1
9 2 z 1 b2
10 2 z 1 b2
11 2 z 1 b2
Otherwise it's a matter of using pd.Categorical
, same concept as above, just that you convert within each group, the column into a category and just pull out the code:
def le(x):
return pd.DataFrame(pd.Categorical(x).codes,index=x.index)
CodePudding user response:
In [203]: a = pd.DataFrame([[k, value, idx] for k,v in values.items() for idx,value in enumerate(v)], columns=['c1', 'c3', 'gr'])
...: b = df.assign(gr=df.groupby(['c1']).transform(lambda x: (x.ne(x.shift()).cumsum())- 1))
...: print(b)
...: b.merge(a).drop(columns='gr')
...:
# b
c1 c2 gr
0 1 a 0
1 1 a 0
2 1 a 0
3 1 b 1
4 1 b 1
5 1 b 1
6 2 y 0
7 2 y 0
8 2 y 0
9 2 z 1
10 2 z 1
11 2 z 1
Out[203]:
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2