I discovered some unexpected behavior when creating or assigning a new column in Pandas. When I filter or sort the pd.DataFrame (thus mixing up the indexes) and then create a new column from a pd.Series, Pandas reorders the series to map to the DataFrame index. For example:
df = pd.DataFrame({'a': ['alpha', 'beta', 'gamma']},
index=[2, 0, 1])
df['b'] = pd.Series(['alpha', 'beta', 'gamma'])
index | a | b |
---|---|---|
2 | alpha | gamma |
0 | beta | alpha |
1 | gamma | beta |
I think this is happening because the pd.Series has an index [0, 1, 2]
which is getting mapped to the pd.DataFrame index. But I wanted to create the new column with values in the correct "order" ignoring index:
index | a | b |
---|---|---|
2 | alpha | alpha |
0 | beta | beta |
1 | gamma | gamma |
Here's a convoluted example showing how unexpected this behavior is:
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: pd.Series(list(x['num']*2)))
index | num | num_times_two |
---|---|---|
2 | 1 | 6 |
0 | 2 | 2 |
1 | 3 | 4 |
If I use any function that strips the index off the original pd.Series and then returns a new pd.Series, the values get out of order.
Is this a bug in Pandas or intentional behavior? Is there any way to force Pandas to ignore the index when I create a new column from a pd.Series?
CodePudding user response:
If you don't want the conversions of dtypes between pandas and numpy (for example, with datetimes), you can set the index of the Series same as the index of the DataFrame before assigning to a column:
either with
.set_axis()
The original Series will have its index preserved - by default this operation is not in place:
ser = pd.Series(['alpha', 'beta', 'gamma'])
df['b'] = ser.set_axis(df.index)
- or you can change the index of the original Series:
ser.index = df.index # ser.set_axis(df.index, inplace=True) # alternative
df['b'] = ser
OR:
Use a numpy array instead of a Series. It doesn't have indices, so there is nothing to be aligned by.
Any Series can be converted to a numpy array with .to_numpy()
:
df['b'] = ser.to_numpy()
Any other array-like also can be used, for example, a list.
CodePudding user response:
I don't know if it is on purpose, but the new column assignment is based on index, do you need to maintain the old indexes?
If the answer is no you can simply reset the index before adding a new column
df.reset_index(drop=True)
CodePudding user response:
In your example, I don't see any reason to make it a new Series? (Even if something strips the index, like converting to a list)
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: list(x['num']*2))
print(df)
Output:
num num_times_two
2 1 2
0 2 4
1 3 6