Home > other >  Create a new column and assign values to the first row by each group in Python Pandas
Create a new column and assign values to the first row by each group in Python Pandas

Time:12-10

I have an existing pandas dataframe below:

   id  time  c  d
0   1     1  2  3
1   1     3  1  6
2   2     2  3  2
3   2     3  8  6

I also have values stored in a list such as:

list = [0.4, 0.6]

I want to create a new column in the existing dateframe and assign each list element in the first row for each group (id) such as:

   id  time  c  d  new_col
0   1     1  2  3  0.4
1   1     3  1  6
2   2     2  3  2  0.6
3   2     3  8  6

CodePudding user response:

Does this help solve your problem? (I can't see any benefit to only putting that value in the first occurrence of the group and not the other values)

mapping = {1: 0.4, 2: 0.6}

df["new_col"] = df['id'].map(mapping)

Result:

enter image description here

Note: If all of your ids are sequential integers starting at 1 and all of the values in your list are also in that correct order you could convert it to a mapping dict using:

mapping = {i 1: value for i, value in enumerate(your_list)}

CodePudding user response:

Group the dataframe by id and use cumcount to create a sequential counter, then use boolean indexing with loc to assign the list values where the value of the counter is 0

lst = [0.4, 0.6]
df.loc[df.groupby('id').cumcount().eq(0), 'new_col'] = lst

   id  time  c  d  new_col
0   1     1  2  3      0.4
1   1     3  1  6      NaN
2   2     2  3  2      0.6
3   2     3  8  6      NaN

CodePudding user response:

One way using pandas.DataFrame.groupby.ngroup with numpy.take:

l = [0.4, 0.6]
gid = df.groupby("id").ngroup()
df["new_col"] = np.where(gid.duplicated(), "",  np.take(l, gid))

Output:

   id  time  c  d new_col
0   1     1  2  3     0.4
1   1     3  1  6        
2   2     2  3  2     0.6
3   2     3  8  6        

Explain:

The difficulty here is to apply only to the first row of each group.

Applying duplicated to the taken value is dangerous since it will remove values in another group should the values happened to duplicate (e.g. if l == [0.4, 0.6, 0.4]).

So duplicated is applied to gid to assert we extract the first row of each group.

  • df.groupby("id").ngroup(): enumerates the group to create indices. Note that this is invulnerable to the type of df["id"], such as ["a", "a", "b", "b"].

    In case df["id"] is always numeric and starts from 1, you can replace this with

    gid = df["id"] - 1
    
  • np.take(l, gid): With the generated indices, take the list elements.

  • np.where(gid.duplicated, ...): To ensure the taken values from l is only applied to the first rows of each group (a.k.a. non-duplicates), and the rest gets filled with "".

CodePudding user response:

Use Series.map with dictionary created by enumerate and set values only first per groups in Series.mask with Series.duplicated:

L=[0.4, 0.6]

df["new_col"] = df['id'].sub(1).map(dict(enumerate(L))).mask(df['id'].duplicated())

print (df)
   id  time  c  d  new_col
0   1     1  2  3      0.4
1   1     3  1  6      NaN
2   2     2  3  2      0.6
3   2     3  8  6      NaN

L=[0.4, 0.6]

df["new_col"] = df['id'].sub(1).map(dict(enumerate(L))).mask(df['id'].duplicated(),'')

print (df)
   id  time  c  d new_col
0   1     1  2  3     0.4
1   1     3  1  6        
2   2     2  3  2     0.6
3   2     3  8  6        

If possible any groups in id, e.g. 10, 20 use GroupBy.ngroup:

L=[0.4, 0.6]

df["new_col"] = (df.groupby('id').ngroup().map(dict(enumerate(L)))
                   .mask(df['id'].duplicated(),''))

print (df)
   id  time  c  d new_col
0  10     1  2  3     0.4
1  10     3  1  6        
2  20     2  3  2     0.6
3  20     3  8  6        
  • Related