So lets say I have a DataFrame:
stuff temp
id
1 3 20.0
1 6 20.1
1 7 21.4
2 1 30.2
2 3 0.0
2 2 34.0
3 7 0.0
3 6 0.0
3 2 14.4
And I want to drop the index; what method is better to use?
There is
df.reset_index(drop=True)
, which is what I will usually useBut there is also
df.assign(Index=range(len(df))).set_index('Index')
, which I don't usually useAnd, is there any other methods?
Well, I want to find the most efficient/best way to drop the index of a pd.DataFrame
. Can you give me a clear explanation. I'm doing a efficient code-writing project and I want to know the best options. Thanks.
CodePudding user response:
Here are couple of ways that can be used to reset a pandas DataFrame
index:
import pandas as pd
dummy_data = lambda: pd.DataFrame(
columns = ['id', 'stuff', 'temp'],
data = [
[3, 20.0, 1],
[6, 20.1, 1],
[7, 21.4, 1],
[1, 30.2, 2],
[3, 0.0, 2],
[2, 34.0, 2],
[7, 0.0, 3],
[6, 0.0, 3],
[2, 14.4, 3],
]
).set_index('id', drop=True)
df = dummy_data()
method_1 = df.reset_index(drop=True)
df = dummy_data()
method_2 = df.reset_index(drop=True, inplace=True)
df = dummy_data()
method_3 = df.index = range(df.shape[0])
df = dummy_data()
method_4 = df.assign(Index=range(len(df))).set_index('Index')
df = dummy_data()
method_5 = df.assign(Index=range(len(df))).set_index('Index', inplace=True)
Important: when comparing the above implementations, it's important to recreate the test dataframe before measuring each implementation's performance. As some of the solutions use inplace=True
they modify the underlying dataframe, thus modifying the dataframe used to measure the next method's performance.
Comparing Execution Times
Using the %%timeit
magic, we can compare each of the methods, to determine
their overall performance.
The following table summarizes the time profilling of each conventional method for resetting the index, using the table presented in the original author question:
Method | Execution Time [µs ± µs /100000 loops each] | Observations |
---|---|---|
Method 1 | 42.4 ± 1.11 | Uses df.reset_index(drop=True) |
Method 2 | 12.7 ± 2.72 | Same as method 1, but inplace: df.reset_index(drop=True, inplace=True) |
Method 3 | 10.6 ± 0.082 | Overwrites the index with: df.index = range(df.shape[0]) |
Method 4 | 812 ± 21.5 | Assign apparently is the slowest method |
Method 5 | 692 ± 50.2 | Assign gets improves a little, by making the changes inplace |
According to the above results, it seems that the fastest way to reset an index is to overwrite the original index, as shown in the third method (df.index = range(df.shape[0])
).
In addition, both pandas.DataFrame.reset_index
and pandas.DataFrame.set_index
run faster when setting the attribute inplace
to True
. That's because when set to True
, both methods modifiy the DataFrame in place (instead of creating a new object).
Increasing the number of rows seem to impact all methods somewhat proportionally. Therefore, when running the performance test using an input dataset of 9,000,000 rows, Method 3 seems to remain the fastest amongst the tested implementations:
Method | Execution Time [µs ± µs /100000 loops each] | Observations |
---|---|---|
Method 1 | 829 ± 2430 | Uses df.reset_index(drop=True) |
Method 2 | 12.2 ± 0.095 | Same as method 1, but inplace: df.reset_index(drop=True, inplace=True) |
Method 3 | 10.5 ± 0.095 | Overwrites the index with: df.index = range(df.shape[0]) |
Method 4 | 832 ± 67 | Assign apparently is the slowest method |
Method 5 | 747 ± 25.2 | Assign gets improves a little, by making the changes inplace |