Difference between pandas functions: df.assign() vs df.reset

So lets say I have a DataFrame:

    stuff  temp
id             
1       3  20.0
1       6  20.1
1       7  21.4
2       1  30.2
2       3   0.0
2       2  34.0
3       7   0.0
3       6   0.0
3       2  14.4

And I want to drop the index; what method is better to use?

There is df.reset_index(drop=True), which is what I will usually use
But there is also df.assign(Index=range(len(df))).set_index('Index'), which I don't usually use
And, is there any other methods?

Well, I want to find the most efficient/best way to drop the index of a pd.DataFrame. Can you give me a clear explanation. I'm doing a efficient code-writing project and I want to know the best options. Thanks.

CodePudding user response：

Here are couple of ways that can be used to reset a pandas DataFrame index:

import pandas as pd

dummy_data = lambda: pd.DataFrame(
    columns = ['id', 'stuff', 'temp'],
    data = [
        [3, 20.0, 1],
        [6, 20.1, 1],
        [7, 21.4, 1],
        [1, 30.2, 2],
        [3, 0.0, 2],
        [2, 34.0, 2],
        [7, 0.0, 3],
        [6, 0.0, 3],
        [2, 14.4, 3],
    ]
).set_index('id', drop=True)

df = dummy_data()
method_1 = df.reset_index(drop=True)

df = dummy_data()
method_2 = df.reset_index(drop=True, inplace=True)

df = dummy_data()
method_3 = df.index = range(df.shape[0])

df = dummy_data()
method_4 = df.assign(Index=range(len(df))).set_index('Index')

df = dummy_data()
method_5 = df.assign(Index=range(len(df))).set_index('Index', inplace=True)

Important: when comparing the above implementations, it's important to recreate the test dataframe before measuring each implementation's performance. As some of the solutions use inplace=True they modify the underlying dataframe, thus modifying the dataframe used to measure the next method's performance.

Comparing Execution Times

Using the %%timeit magic, we can compare each of the methods, to determine their overall performance.

The following table summarizes the time profilling of each conventional method for resetting the index, using the table presented in the original author question:

Method	Execution Time [µs ± µs /100000 loops each]	Observations
Method 1	42.4 ± 1.11	Uses `df.reset_index(drop=True)`
Method 2	12.7 ± 2.72	Same as method 1, but inplace: `df.reset_index(drop=True, inplace=True)`
Method 3	10.6 ± 0.082	Overwrites the index with: `df.index = range(df.shape[0])`
Method 4	812 ± 21.5	Assign apparently is the slowest method
Method 5	692 ± 50.2	Assign gets improves a little, by making the changes inplace

According to the above results, it seems that the fastest way to reset an index is to overwrite the original index, as shown in the third method (df.index = range(df.shape[0])).

In addition, both pandas.DataFrame.reset_index and pandas.DataFrame.set_index run faster when setting the attribute inplace to True. That's because when set to True, both methods modifiy the DataFrame in place (instead of creating a new object).

Increasing the number of rows seem to impact all methods somewhat proportionally. Therefore, when running the performance test using an input dataset of 9,000,000 rows, Method 3 seems to remain the fastest amongst the tested implementations:

Method	Execution Time [µs ± µs /100000 loops each]	Observations
Method 1	829 ± 2430	Uses `df.reset_index(drop=True)`
Method 2	12.2 ± 0.095	Same as method 1, but inplace: `df.reset_index(drop=True, inplace=True)`
Method 3	10.5 ± 0.095	Overwrites the index with: `df.index = range(df.shape[0])`
Method 4	832 ± 67	Assign apparently is the slowest method
Method 5	747 ± 25.2	Assign gets improves a little, by making the changes inplace