I have an existing pandas dataframe below:
id time c d
0 1 1 2 3
1 1 3 1 6
2 2 2 3 2
3 2 3 8 6
I also have values stored in a list such as:
list = [0.4, 0.6]
I want to create a new column in the existing dateframe and assign each list element in the first row for each group (id) such as:
id time c d new_col
0 1 1 2 3 0.4
1 1 3 1 6
2 2 2 3 2 0.6
3 2 3 8 6
CodePudding user response:
Does this help solve your problem? (I can't see any benefit to only putting that value in the first occurrence of the group and not the other values)
mapping = {1: 0.4, 2: 0.6}
df["new_col"] = df['id'].map(mapping)
Result:
Note: If all of your ids are sequential integers starting at 1 and all of the values in your list are also in that correct order you could convert it to a mapping dict using:
mapping = {i 1: value for i, value in enumerate(your_list)}
CodePudding user response:
Group the dataframe by id
and use cumcount
to create a sequential counter, then use boolean indexing with loc
to assign the list values where the value of the counter is 0
lst = [0.4, 0.6]
df.loc[df.groupby('id').cumcount().eq(0), 'new_col'] = lst
id time c d new_col
0 1 1 2 3 0.4
1 1 3 1 6 NaN
2 2 2 3 2 0.6
3 2 3 8 6 NaN
CodePudding user response:
One way using pandas.DataFrame.groupby.ngroup
with numpy.take
:
l = [0.4, 0.6]
gid = df.groupby("id").ngroup()
df["new_col"] = np.where(gid.duplicated(), "", np.take(l, gid))
Output:
id time c d new_col
0 1 1 2 3 0.4
1 1 3 1 6
2 2 2 3 2 0.6
3 2 3 8 6
Explain:
The difficulty here is to apply only to the first row of each group.
Applying duplicated
to the taken value is dangerous since it will remove values in another group should the values happened to duplicate (e.g. if l == [0.4, 0.6, 0.4]
).
So duplicated
is applied to gid
to assert we extract the first row of each group.
df.groupby("id").ngroup()
: enumerates the group to create indices. Note that this is invulnerable to the type ofdf["id"]
, such as["a", "a", "b", "b"]
.In case
df["id"]
is always numeric and starts from1
, you can replace this withgid = df["id"] - 1
np.take(l, gid)
: With the generated indices, take the list elements.np.where(gid.duplicated, ...)
: To ensure the taken values froml
is only applied to the first rows of each group (a.k.a. non-duplicates), and the rest gets filled with""
.
CodePudding user response:
Use Series.map
with dictionary created by enumerate
and set values only first per groups in Series.mask
with Series.duplicated
:
L=[0.4, 0.6]
df["new_col"] = df['id'].sub(1).map(dict(enumerate(L))).mask(df['id'].duplicated())
print (df)
id time c d new_col
0 1 1 2 3 0.4
1 1 3 1 6 NaN
2 2 2 3 2 0.6
3 2 3 8 6 NaN
L=[0.4, 0.6]
df["new_col"] = df['id'].sub(1).map(dict(enumerate(L))).mask(df['id'].duplicated(),'')
print (df)
id time c d new_col
0 1 1 2 3 0.4
1 1 3 1 6
2 2 2 3 2 0.6
3 2 3 8 6
If possible any groups in id
, e.g. 10, 20
use GroupBy.ngroup
:
L=[0.4, 0.6]
df["new_col"] = (df.groupby('id').ngroup().map(dict(enumerate(L)))
.mask(df['id'].duplicated(),''))
print (df)
id time c d new_col
0 10 1 2 3 0.4
1 10 3 1 6
2 20 2 3 2 0.6
3 20 3 8 6