I have the following dataframe (sample):
import pandas as pd
min_id = 1
max_id = 10
data = [['A', 2], ['A', 3], ['A', 1], ['A', 4], ['A', 4], ['A', 2],
['B', 4], ['B', 5], ['B', 7], ['B', 4], ['B', 2],
['C', 1], ['C', 3], ['C', 2], ['C', 1], ['C', 5], ['C', 2] ,['C', 1],
['D', 1], ['D', 1], ['D', 1], ['D', 1]]
df = pd.DataFrame(data = data, columns = ['group', 'val'])
group val
0 A 2
1 A 3
2 A 1
3 A 4
4 A 4
5 A 2
6 B 4
7 B 5
8 B 7
9 B 4
10 B 2
11 C 1
12 C 3
13 C 2
14 C 1
15 C 5
16 C 2
17 C 1
18 D 1
19 D 1
20 D 1
21 D 1
I would like to create a column called "id" which shows the id with a min value of 1 (min_id) and a max value of 10 (max_id) per group. So the values between min and max depend on the number of rows per group. Here you can see the desired output:
data = [['A', 2, 1], ['A', 3, 2.8], ['A', 1, 4.6], ['A', 4, 6.4], ['A', 4, 8.2], ['A', 2, 10],
['B', 4, 1], ['B', 5, 3.25], ['B', 7, 5.5], ['B', 4, 7.75], ['B', 2, 10],
['C', 1, 1], ['C', 3, 2.5], ['C', 2, 4], ['C', 1, 5.5], ['C', 5, 7], ['C', 2, 8.5] ,['C', 1, 10],
['D', 1, 1], ['D', 1, 4], ['D', 1, 7], ['D', 1, 10]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'val', 'id'])
group val id
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00
So I was wondering if anyone knows how to automatically create the column "id" using pandas
? Please note that the number of rows could be way more then in the sample dataframe.
CodePudding user response:
One solution could be as follows:
import pandas as pd
import numpy as np
df['id'] = df.groupby('group', sort=False).size()\
.apply(lambda x: np.linspace(1,10,x)).explode().reset_index(drop=True)
print(df)
group val id
0 A 2 1.0
1 A 3 2.8
2 A 1 4.6
3 A 4 6.4
4 A 4 8.2
5 A 2 10.0
6 B 4 1.0
7 B 5 3.25
8 B 7 5.5
9 B 4 7.75
10 B 2 10.0
11 C 1 1.0
12 C 3 2.5
13 C 2 4.0
14 C 1 5.5
15 C 5 7.0
16 C 2 8.5
17 C 1 10.0
18 D 1 1.0
19 D 1 4.0
20 D 1 7.0
21 D 1 10.0
CodePudding user response:
Use:
def f(x):
x = pd.Series(np.nan, index=x.index)
x.iat[0] = min_id
x.iat[-1] = max_id
return x.interpolate()
df['ID'] = df.groupby('group')['group'].transform(f)
print (df)
group val ID
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00
Or:
f = lambda x: np.linspace(min_id,max_id,len(x))
df['ID'] = df.groupby('group')['group'].transform(f)
print (df)
group val ID
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00