Pandas sort_values seems to sort list but getting similar errors?-CodePudding

Using Python 3.9 with Pycharm on mac. I'm loading a .csv containing slices of months of dates/times and values. All data is in strings. Each slice is internally sorted, but the combined list of slices are not. Instead of "123 456 789" it is "321 654 876", as can be seen here:

                time,value
2019-12-11 10:00:00,156
2019-12-11 09:00:00,156
2020-02-07 20:00:00,149.5
2020-02-07 19:00:00,149.8

To remedy this I first convert time column into datetime using df['time'] = pd.to_datetime(df['time']). I then sort the values using df.sort_values(by="time", inplace=True, ascending=True). The resulting list appears correct (even if I graph it), but I tried to create an error finder that compares each value to the last and flags how many times it is out of order:

error1 = 0
error2 = 0
   for _ in range(len(df)):
        if _ > 0:
            if df['time'][_] > df['time'][_ - 1]:
                error1  = 1

df.sort_values(by="time", inplace=True, ascending=True)

for _ in range(len(df)):
    if _ > 0:
        if df['time'][_] > df['time'][_ - 1]:
            x = df['time'][_]
            y = df['time'][_] - df['time'][_ - 1]
            error2  = 1

print(error1 == error2)

Output: True

The post-sort loop should flag either 0% or 100% of values (depending on ascending/descending), but it still flags the same errors as the pre-sort loop. Oddly, the corresponding values on the post-sort list look like they sorted appropriately.

I tried:

I confirmed that I was not sorting strings and I was including "inplace=True", which are two common sort_value SO questions
I also tried assigning to a different variable: df_sorted = df.sort_values(by="time") However this gave the same result.

CodePudding user response：

@QuangHoang has the correct answer in the comments.

Expanded: Although df.sort_values(by="time", inplace=True, ascending=True) has the effect of re-sorting the values in the column specified, it does not change the index. To do this, I added ignore_index=True and another line df.reset_index(drop=True, inplace=True), making the full code:

df.sort_values(by="time", inplace=True, ascending=True, ignore_index=True)
df.reset_index(drop=True, inplace=True)

As a last check when saving the file, I added index=True which saves the index when exporting as CSV. That way you can directly troubleshoot what is happening to the values and indices with any manipulation.

df.to_csv(f'fname', index=True)