I have generated three outputs from a dataframe and I am trying to reset the identifiers of my sentences (Sentence_ID
) by starting from 1 for each output.
Output exemple :
Sentence_ID Mention Tag
6388 Chailland B-LOCATION
6388 , O
6388 Mayenne B-LOCATION
6389 poste O
6389 de O
6389 Goumois B-LOCATION
6389 ( I-LOCATION
6389 Doubs I-LOCATION
6389 ) I-LOCATION
6389 . O
6390 Pichet B-PERSON
6390 ( O
6390 veuve O
6390 ) O
6390 , O
6390 de O
6390 Paris B-LOCATION
6390 . O
... continue
Expected Output :
Sentence_ID Mention Tag
1 Chailland B-LOCATION
1 , O
1 Mayenne B-LOCATION
2 poste O
2 de O
2 Goumois B-LOCATION
2 ( I-LOCATION
2 Doubs I-LOCATION
2 ) I-LOCATION
2 . O
3 Pichet B-PERSON
3 ( O
3 veuve O
3 ) O
3 , O
3 de O
3 Paris B-LOCATION
3 . O
... continue
I must be missing something, but not sure if I should apply a counter on Sentence_id
column (via group_by()
) or reset_index
on this specific columns to complete this task.
If anyone has a lead, thanks in advance.
CodePudding user response:
You can use pd.factorize
to generate a new set of sequence numbers, as follows:
df['Sentence_ID'] = pd.factorize(df['Sentence_ID'])[0] 1
or use Series.factorize
df['Sentence_ID'] = df['Sentence_ID'].factorize()[0] 1
Result:
print(df)
Sentence_ID Mention Tag
0 1 Chailland B-LOCATION
1 1 , O
2 1 Mayenne B-LOCATION
3 2 poste O
4 2 de O
5 2 Goumois B-LOCATION
6 2 ( I-LOCATION
7 2 Doubs I-LOCATION
8 2 ) I-LOCATION
9 2 . O
10 3 Pichet B-PERSON
11 3 ( O
12 3 veuve O
13 3 ) O
14 3 , O
15 3 de O
16 3 Paris B-LOCATION
17 3 . O
CodePudding user response:
You can create a dictionary whose keys are old IDs and values are new IDs, and use it to map a new Sentence_ID
column
mapping = dict(zip(df["Sentence_ID"].unique(), range(1, df["Sentence_ID"].nunique() 1)))
df["Sentence_ID"] = df["Sentence_ID"].map(mapping)