Processing data in one column using split and index based on value in other column pandas-CodePudding

I'm wondering if there's a nicer way to address the following problem

I have a dataframe with the following example structure:

Split_key	label	sub_label
A_B_C	7	""
A_B_C	7	""
A_B_C	8	""
A_B_C	8	""
A_B_C	10	""
A_B_C	10	""
D_E_F	2	""
D_E_F	7	""
D_E_F	15	""
G_H_I	1	""
G_H_I	2	""
G_H_I	3	""

I wish to populate sub_label with a value that corresponds to splitting the value in Split_key on the "_" character and grabs the correct element based on label. The correct element is the index of the value in label in the unique sorted array of labels that share the same value in Split_key.

The correct end result is shown here.

Split_key	label	sub_label
A_B_C	7	A
A_B_C	7	A
A_B_C	8	B
A_B_C	8	B
A_B_C	10	C
A_B_C	10	C
D_E_F	2	D
D_E_F	7	E
D_E_F	15	F
G_H_I	1	G
G_H_I	2	H
G_H_I	3	I

My initial attempt is very slow on large dataframes:

for i,row in bigframe.iterrows():
    duplicates=bigframe[ bigframe["Split_key"]==row["Split_key"]]
    if len(row["Split_key"].split("_"))<1:
        continue
    if len(duplicates)==1:
            row["sub_label"]=row["Split_key"].split("_")[0]
    else:
        try:
              shift=sorted(duplicates["label"].unique().astype(int)).index(int(row["label"]))
        except:
            shift=0
        if (shift<len(row["Split_key"].split("_"))):
            row["sub_label"]=row["Split_key"].split("_")[shift]

Is there any way to vectorize this code in python/pandas? I know using group / ungroup in R makes this possible from a previous post.

CodePudding user response：

Let us try groupby with factorize then map

df['new'] = df.groupby('Split_key',as_index=False).apply(lambda x :  pd.Series(x['label'].factorize()[0]).map(dict(enumerate(x['Split_key'].iloc[0].split('_'))))).values
df
Out[869]: 
   Split_key  label  sub_label new
0      A_B_C      7        NaN   A
1      A_B_C      7        NaN   A
2      A_B_C      8        NaN   B
3      A_B_C      8        NaN   B
4      A_B_C     10        NaN   C
5      A_B_C     10        NaN   C
6      D_E_F      2        NaN   D
7      D_E_F      7        NaN   E
8      D_E_F     15        NaN   F
9      G_H_I      1        NaN   G
10     G_H_I      2        NaN   H
11     G_H_I      3        NaN   I

CodePudding user response：

I believe it would be clearer if you run it as a for loop:

grouped = df.groupby('Split_key', sort = False)
collection = []
for key, frame in grouped:
    out, _ = frame.label.factorize()
    out = [key.split('_')[num] for num in out]
    frame['sub_label'] = out
    collection.append(frame)

pd.concat(collection)
   Split_key  label sub_label
0      A_B_C      7         A
1      A_B_C      7         A
2      A_B_C      8         B
3      A_B_C      8         B
4      A_B_C     10         C
5      A_B_C     10         C
6      D_E_F      2         D
7      D_E_F      7         E
8      D_E_F     15         F
9      G_H_I      1         G
10     G_H_I      2         H
11     G_H_I      3         I