Home > OS >  How to sort a lot of csv files to read them in a specific order?
How to sort a lot of csv files to read them in a specific order?

Time:12-08

Hello I have multiple csv files(a lot) that have same names (filename) but have a number at the end. For example I have 4 csv files have same filename and at the end of first file there is no extra number, but for the second file there is a (0) at the end, and for the third there is (1) at the end of the filename and so on.....

I am using pandas read to read the files in a for loop because I have a lot of files in a folder, and to sort them I am using sorted. The problem I have is it sorts the filename fine and the first file good too but I have issue when it has the a filename(0) at the end. It puts it at the last, I want to solve this because these individual files together have the data of a one big file and I am trying to concatenate them automatically. Everything works but the sorting order is not what I wanted and because of that I have same file concatenating(which is what I want) but in wrong order.

How can I rectify this. BTY after reading I am sorting files in a list and it sorts in the wrong order like this ['filename','filename1','filname2','filename0']. But I want it ['Filename','Filename0','Filename1','Filename2'] in this order.

I know the filenames in the list are strings, I have tried converting them to int and float but I have no success I get this value error (ValueError: invalid literal for int() with base 10:)

Any help would be greatly appreciated. I cannot upload code because it has a lot of functions and it is absolutely massive to find these bits it will take a very long time for me. Sorry about that.

CodePudding user response:

Use rsplit and sorted methods with a custom function that does some checking and serves as a key for the sort comparison.

You can try like this :

def function_work(x):
    y = x.rsplit('.', 2)[-2]
    return ('log' not in x, int(y) if y.isdigit() else float('inf'), x)

csvFiles = ['Filename5.csv', 'Filename0.csv', 'Filename1.csv', 'Filename.csv', 'Filename2.csv']
print(sorted(csvFiles, key=function_work, reverse=False))
#output : ['Filename.csv', 'Filename0.csv', 'Filename1.csv', 'Filename2.csv', 'Filename5.csv']

CodePudding user response:

The sorted function takes an additional keyword argument called key that tells it how to sort the items in the iterable. This argument, key, is a function that is expected to take each entry from the input iterable and give it a "rank" or a "sort order" -

In your case, you'll need to define a key function that will put the "no suffix" file before "0" -

lst = ['abc.csv', 'abc (0).csv', 'abc (1).csv']
filenames_split_lst = [_.rsplit('.', 1) for _ in lst]
# [['abc', 'csv'], ['abc (0)', 'csv'], ['abc (1)', 'csv']]
base_filenames = [_ for _, csv in filenames_split_lst]
# ['abc', 'abc (0)', 'abc (1)']

def sorting_function(base_filename):
    if (len(base_filename.split()) == 1):
        return 0
    elif len(base_filename.split()) == 2:
        number_suffix = base_filename.split()[1][1:-1]
        return int(number_suffix)   1

sorted(base_filenames, key=sorting_function)
# ['abc', 'abc (0)', 'abc (1)']
  • Related