I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id
columns contains large integers (6-digits). I want a way to simplify it, starting from 10
, so that 542588
becomes 10
, 542594
becomes 11
, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
CodePudding user response:
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
CodePudding user response:
You can use factorize
:
df['id'] = df['id'].factorize()[0] 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize
will enumerate the keys in the order that they occur in your data, while groupby().ngroup()
solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize
by sorting the data first. Or you can replicate the data order with groupby()
by passing sort=False
to it.
CodePudding user response:
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id = 1
df['id'] = df['id'].map(new_ids)