I'm wondering if there's a nicer way to address the following problem
I have a dataframe with the following example structure:
Split_key | label | sub_label |
---|---|---|
A_B_C | 7 | "" |
A_B_C | 7 | "" |
A_B_C | 8 | "" |
A_B_C | 8 | "" |
A_B_C | 10 | "" |
A_B_C | 10 | "" |
D_E_F | 2 | "" |
D_E_F | 7 | "" |
D_E_F | 15 | "" |
G_H_I | 1 | "" |
G_H_I | 2 | "" |
G_H_I | 3 | "" |
I wish to populate sub_label with a value that corresponds to splitting the value in Split_key on the "_" character and grabs the correct element based on label. The correct element is the index of the value in label in the unique sorted array of labels that share the same value in Split_key.
The correct end result is shown here.
Split_key | label | sub_label |
---|---|---|
A_B_C | 7 | A |
A_B_C | 7 | A |
A_B_C | 8 | B |
A_B_C | 8 | B |
A_B_C | 10 | C |
A_B_C | 10 | C |
D_E_F | 2 | D |
D_E_F | 7 | E |
D_E_F | 15 | F |
G_H_I | 1 | G |
G_H_I | 2 | H |
G_H_I | 3 | I |
My initial attempt is very slow on large dataframes:
for i,row in bigframe.iterrows():
duplicates=bigframe[ bigframe["Split_key"]==row["Split_key"]]
if len(row["Split_key"].split("_"))<1:
continue
if len(duplicates)==1:
row["sub_label"]=row["Split_key"].split("_")[0]
else:
try:
shift=sorted(duplicates["label"].unique().astype(int)).index(int(row["label"]))
except:
shift=0
if (shift<len(row["Split_key"].split("_"))):
row["sub_label"]=row["Split_key"].split("_")[shift]
Is there any way to vectorize this code in python/pandas? I know using group / ungroup in R makes this possible from a previous post.
CodePudding user response:
Let us try groupby
with factorize
then map
df['new'] = df.groupby('Split_key',as_index=False).apply(lambda x : pd.Series(x['label'].factorize()[0]).map(dict(enumerate(x['Split_key'].iloc[0].split('_'))))).values
df
Out[869]:
Split_key label sub_label new
0 A_B_C 7 NaN A
1 A_B_C 7 NaN A
2 A_B_C 8 NaN B
3 A_B_C 8 NaN B
4 A_B_C 10 NaN C
5 A_B_C 10 NaN C
6 D_E_F 2 NaN D
7 D_E_F 7 NaN E
8 D_E_F 15 NaN F
9 G_H_I 1 NaN G
10 G_H_I 2 NaN H
11 G_H_I 3 NaN I
CodePudding user response:
I believe it would be clearer if you run it as a for loop:
grouped = df.groupby('Split_key', sort = False)
collection = []
for key, frame in grouped:
out, _ = frame.label.factorize()
out = [key.split('_')[num] for num in out]
frame['sub_label'] = out
collection.append(frame)
pd.concat(collection)
Split_key label sub_label
0 A_B_C 7 A
1 A_B_C 7 A
2 A_B_C 8 B
3 A_B_C 8 B
4 A_B_C 10 C
5 A_B_C 10 C
6 D_E_F 2 D
7 D_E_F 7 E
8 D_E_F 15 F
9 G_H_I 1 G
10 G_H_I 2 H
11 G_H_I 3 I