Home > Back-end >  Weird df.sort_values result - randomly leaves multiple rows empty but missing values are misplaced a
Weird df.sort_values result - randomly leaves multiple rows empty but missing values are misplaced a

Time:02-08

I've run into a very weird problem when using pandas to sort my data frame using df.sort_values on my date-time column.

In some sense the code works, it does sort the rows by date, however, in doing so it cuts some of the rows in half leaving one half of the row where it should be but leaving the other half of the row empty. The other half is put in the next row down and put in the wrong columns.

It's easier to understand from the screenshots.

Pre-Sorted Data

Post-Sorted Data

It's difficult to see from the Post-Sorted date picture but not all rows are treated this way(only one isn't in the pic, second from the bottom). Some rows turn out just fine and others don't, it seems very random.

The problem code is shown below.

df = pd.read_csv('Pre-Sort.csv', low_memory=False, skiprows=0, lineterminator='\n')

df = df.sort_values(by="created")

df.to_csv('Post-Sort.csv')

I've tried using inplace=True and ignore_index=True as well but both returned the same results.

I'm worried this is a problem with the data as I've used df.sort_values before with virtually the same data and it worked fine.

Originally, with the data that worked, I realized that for some reason two months' worth of data roughly 15% of total data vanished. The culprit was when I merged two data frames together both containing all months with df.append. For some reason, df.append would consistently skip these two months so, in the end, I merged them through the mac terminal which kept the missing months.

This is as far as I know the only major difference between the two times I tried the code. It could be a red herring though. I had to redo many of the operations from scratch, so I might have done something differently than before, so it could be that.

Also, this changes the file size from 192MB to 210.3MB which probably shouldn't happen with a sort.

I just need to sort the data so I can resample it into daily variables, so if anyone knows a way to resample without needing to sort that would work for me just as well.

CodePudding user response:

I think you're having problems with indexes (both in merging and sorting).

You can try the code above:

df = pd.read_csv('Pre-Sort.csv', low_memory=False, skiprows=0, lineterminator='\n')

# no assignments here, that was the reason of the NoneType happening
df.sort_values(by='created', inplace=True, ignore_index=True)

df.to_csv('Post-Sort.csv')

CodePudding user response:

Solved mostly.

Problem wasn't the sorted function but a pre-existing problem with the dataset. The displacement issue was already there but was only made apparent when the data was mixed together. Still trying to figure out exactly where in the datahandling it went wrong put I think it's a delimiter issue.

So far, an equally confusing problem which I'll likely have to create another post on to fix

  •  Tags:  
  • Related