Reset a group of identifiers in a Pandas dataframe column-CodePudding

I have generated three outputs from a dataframe and I am trying to reset the identifiers of my sentences (Sentence_ID) by starting from 1 for each output.

Output exemple :

Sentence_ID  Mention Tag
6388    Chailland   B-LOCATION
6388    ,   O
6388    Mayenne B-LOCATION

6389    poste   O
6389    de  O
6389    Goumois B-LOCATION
6389    (   I-LOCATION
6389    Doubs   I-LOCATION
6389    )   I-LOCATION
6389    .   O
        
6390    Pichet  B-PERSON
6390    (   O
6390    veuve   O
6390    )   O
6390    ,   O
6390    de  O
6390    Paris   B-LOCATION
6390    .   O
... continue

Expected Output :

Sentence_ID  Mention Tag
1 Chailland B-LOCATION
1   ,   O
1   Mayenne B-LOCATION

2   poste   O
2   de  O
2   Goumois B-LOCATION
2   (   I-LOCATION
2   Doubs   I-LOCATION
2   )   I-LOCATION
2   .   O
        
3   Pichet  B-PERSON
3   (   O
3   veuve   O
3   )   O
3   ,   O
3   de  O
3   Paris   B-LOCATION
3   .   O
... continue

I must be missing something, but not sure if I should apply a counter on Sentence_id column (via group_by()) or reset_index on this specific columns to complete this task.

If anyone has a lead, thanks in advance.

CodePudding user response：

You can use pd.factorize to generate a new set of sequence numbers, as follows:

df['Sentence_ID'] = pd.factorize(df['Sentence_ID'])[0]   1

or use Series.factorize

df['Sentence_ID'] = df['Sentence_ID'].factorize()[0]   1

Result:

print(df)


    Sentence_ID    Mention         Tag
0             1  Chailland  B-LOCATION
1             1          ,           O
2             1    Mayenne  B-LOCATION
3             2      poste           O
4             2         de           O
5             2    Goumois  B-LOCATION
6             2          (  I-LOCATION
7             2      Doubs  I-LOCATION
8             2          )  I-LOCATION
9             2          .           O
10            3     Pichet    B-PERSON
11            3          (           O
12            3      veuve           O
13            3          )           O
14            3          ,           O
15            3         de           O
16            3      Paris  B-LOCATION
17            3          .           O

CodePudding user response：

You can create a dictionary whose keys are old IDs and values are new IDs, and use it to map a new Sentence_ID column

mapping = dict(zip(df["Sentence_ID"].unique(), range(1, df["Sentence_ID"].nunique()  1)))
df["Sentence_ID"] = df["Sentence_ID"].map(mapping)