import pandas as pd
df1 = pd.DataFrame({
"value": [1, 1, 1, 2, 2, 2]})
print(df1)
print("-------------------------")
print(df1.reset_index())
print("-------------------------")
print(df1.reset_index().index)
print("-------------------------")
print(df1.reset_index()["index"])
produces the output
value
0 1
1 1
2 1
3 2
4 2
5 2
-------------------------
index value
0 0 1
1 1 1
2 2 1
3 3 2
4 4 2
5 5 2
-------------------------
RangeIndex(start=0, stop=6, step=1)
-------------------------
0 0
1 1
2 2
3 3
4 4
5 5
Name: index, dtype: int64
I am wondering why print(df1.reset_index().index)
and
print(df1.reset_index()["index"])
prints different things in this case? The latter prints the "index" column, while the former prints the indices.
If we want to access the reset indices (the column), then it seems we have to use brackets?
CodePudding user response:
The .index
attribute in a pandas DataFrame will always point to the Index (row label) of the DataFrame not a column named "index".
If we want to access the reset indices (the column), then it seems we have to use brackets?
Yes, or you can assign a name when reseting the index for example:
df1.reset_index(names='the_index').the_index
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# Name: the_index, dtype: int64
CodePudding user response:
Several things happened. First, when you don't specify and index, pandas
uses a RangeIndex
object as a virtual index of the dataframe. The dataframe is a collection of numpy arrays which are naturally indexed from 0, 1, 2, and etc. Since RangeIndex
is just 0, 1, etc... it doesn't actually create its values in memory. Had you printed the index of the original df1
, it would be a RangeIndex
, just like df1.reset_index().index
.
reset_index
has an optional drop
parameter. By default, pandas will take the existing index and turn it into a column of the dataframe. This was a RangeIndex
object but it had to be expanded into a realized column to fit with the other columns in the df. Had you included drop=True
, there would be no "index" column.
When you reset the index, dataframes always have to have some index and the default is that virtual RangeIndex
you see.
DataFrames have a shortcut where some columns can be addressed by attribute name rather than item (the square brackets). But, if the column name doesn't meet python's attribute naming rules or if it clashes with an existing attribute, you can't reference it that way. .index
is the dataframe index so if you happen to also have a column "index", you need to access it via the square bracket item protocol.
One could argue that pandas should never have allowed the attribute access path because it can't be used consistently. I wouldn't argue that (except I totally would).
CodePudding user response:
It does this because you are printing different things:
print(df1.reset_index().index)
is the same as:
df = df1.reset_index()
print(df)
This firstly adds an Id index to the dataframe then prints the actual index of the df.
print(df1.reset_index()["index"])
is the equivalent of
df = df1.reset_index()
print(df["index"])
It firstly adds an Id index to the dataframe but keeps both "index" and "values" columns. It then prints the Column named "Index" (which is NOT the index of the df)
If you want to make the "index" column the index, you must use:
df = df1.set_index("index")