I have a DataFrame:
name | age | |
---|---|---|
0 | Paul | 25 |
1 | John | 27 |
2 | Bill | 23 |
I know that if I enter:
df[['name']] = df[['age']]
I'll get the following:
name | age | |
---|---|---|
0 | 25 | 25 |
1 | 27 | 27 |
2 | 23 | 23 |
But I expect the same result from the command:
df.loc[:, ['name']] = df.loc[:, ['age']]
But instead, I get this:
name | age | |
---|---|---|
0 | NaN | 25 |
1 | NaN | 27 |
2 | NaN | 23 |
Why NaN? Is it some sort of a bug or is it intended? Can't figure out the reason for such a behaviour.
EDIT:
For some weird reason, if I omit those square brackets []
around column names, I'll get exactly what I expected. That is the command:
df.loc[:, 'name'] = df.loc[:, 'age']
gives the right result:
name | age | |
---|---|---|
0 | 25 | 25 |
1 | 27 | 27 |
2 | 23 | 23 |
CodePudding user response:
From the Docs
Pandas Data Alignment
(emphasis mine):pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.
You can find this excerpt under the Basics
header labelled with Warning.
They have given an example to explain.
In [9]: df[['A', 'B']]
Out[9]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]
In [11]: df[['A', 'B']]
Out[11]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
From Docs(emphasis mine):
This will not modify df because the column alignment is before value assignment.
To Explicitly Avoid Automatic Alignment
Accessing the array can be useful when you need to do some operation without the index (to disable automatic alignment, for example).
The alignment comes into play when LHS and RHS are dataframes. To avoid alignment try using.
df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
You have two cases at hand,
.loc
assignment withpd.DataFrame
..loc
assignment withpd.Series
in EDIT.
.loc
assignment in pd.DataFrame
In pd.DataFrame
has 2 axes index
and columns
. So, when you do
df.loc[:, ['name']] = df.loc[:, ['age']]
LHS has column A
which doesn't align with RHS column B
hence resulting in all NaN
after assignment.
From Docs DataAlignment
(emphasis mine)
Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.
You can find this behaviour in most of the pandas' operations if not all. Example, addition, subtraction, multiplication etc. Non matching indices and columns are filled with NaN
.
Example from Data Alignment and Arthimetics
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"]) df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"]) df df2 A B C D 0 0.045691 -0.014138 1.380871 NaN 1 -0.955398 -1.501007 0.037181 NaN 2 -0.662690 1.534833 -0.859691 NaN 3 -2.452949 1.237274 -0.133712 NaN 4 1.414490 1.951676 -2.320422 NaN 5 -0.494922 -1.649727 -1.084601 NaN 6 -1.047551 -0.748572 -0.805479 NaN 7 NaN NaN NaN NaN 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN
To answer your comment
But why do column indexes need to match? I can see why one want row indexes to match, but why column indexes?
Let's take a look at the above example, if columns are not aligned how would you add two DataFrames? It makes sense to align them on columns and indices.
.loc
assignment in pd.Series
pd.Series
has only one axis i.e index
. That is the reason why it worked when you did
df.loc[:, 'name'] = df.loc[:, 'age']
As pd.Series
has only one axis, pandas tried to align index
and it worked. Ofcourse, if index
doesn't align it results in NaN
values.
From Docs
Series Alignment
(emphasis mine):The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing
NaN
.
CodePudding user response:
That's because for the loc
assignment all index axes are aligned, including the columns: Since age
and name
do not match, there is no data to assign, hence the NaNs.
You can make it work by renaming the columns:
df.loc[:, ["name"]] = df.loc[:, ["age"]].rename(columns={"age": "name"})
or by accessing the numpy array:
df.loc[:, ["name"]] = df.loc[:, ["age"]].values
CodePudding user response:
When you use double brackets [[]] you are assigning a DataFrame. What you want is assign a (column) Series, and for that you use only one bracket [].
Here is some code:
import pandas as pd
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
df[['name']] = df[['age']]
print("\ndf[['name']] = df[['age']]\n",df)
print("df.loc[:, ['age']]:", type(df.loc[:, ['age']]))
print("df.loc[:, ['name']]:", type(df.loc[:, ['name']]))
df.loc[:, ['name']] = df.loc[:, ['age']]
print("\ndf.loc[:, ['name']] = df.loc[:, ['age']]\n",df)
print('=======================')
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
print("type(df.loc[:, 'age']):", type(df.loc[:, 'age']))
print("type(df.loc[:, 'name']):", type(df.loc[:, 'name']))
df.loc[:, 'name'] = df.loc[:, 'age']
print("\ndf.loc[:, 'name'] = df.loc[:, 'age']\n",df)
And the output:
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
df[['name']] = df[['age']]
name age
0 25 25
1 27 27
2 23 23
df.loc[:, ['age']]: <class 'pandas.core.frame.DataFrame'>
df.loc[:, ['name']]: <class 'pandas.core.frame.DataFrame'>
df.loc[:, ['name']] = df.loc[:, ['age']]
name age
0 NaN 25.0
1 NaN 27.0
2 NaN 23.0
=======================
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
type(df.loc[:, 'age']): <class 'pandas.core.series.Series'>
type(df.loc[:, 'name']): <class 'pandas.core.series.Series'>
df.loc[:, 'name'] = df.loc[:, 'age']
name age
0 25 25
1 27 27
2 23 23
However, here is another strange behaviour: Assigning the double brackets to difference variables, say df1
and df2
, and then df1 = df2
works!
Here is some more code:
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
df1 = df.loc[:, ['name']]
df2 = df.loc[:, ['age']]
print("\ndf1 = df.loc[:, ['name']]\n",df1)
print("\ndf2 = df.loc[:, ['age']]\n",df2)
df1=df2
print("\ndf1=df2\ndf1:\n",df1)
And the output:
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
df1 = df.loc[:, ['name']]
name
0 Paul
1 John
2 Bill
df2 = df.loc[:, ['age']]
age
0 25
1 27
2 23
df1=df2
df1:
age
0 25
1 27
2 23