Home > Mobile >  How to convert the 50000 txt file into csv
How to convert the 50000 txt file into csv

Time:08-29

I have many text files. I tried to convert the txt files into a single CSV file, but it is taking a huge time. I put the code on run mode at night and I slept, it processed only 4500 files, but still morning it is running. There is any way to fast way to convert the text files into csv?

Here is my code:

import pandas as pd
import os
import glob
from tqdm import tqdm

# create empty dataframe
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])

# get list of files



file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))

for filename in tqdm(file_list):
    # next file/record
    mydict = {}
    with open(filename) as datafile:
        # read each line and split on "  " space
        for line in tqdm(datafile):
            # Note: partition result in 3 string parts, "key", "   ", "value"
            # array slice third parameter [::2] means steps= 2
            # so only take 1st and 3rd item
            name, var = line.partition("   ")[::2]
            mydict[name.strip()] = var.strip()
        # put dictionary in dataframe
        csvout = csvout.append(mydict, ignore_index=True)

# write to csv
csvout.to_csv("train.csv", sep=";", index=False)

Here is my example text file.

ID                                     0xb379
Delivery_person_ID             BANGRES18DEL02
Delivery_person_Age                 34.000000
Delivery_person_Ratings              4.500000
Restaurant_latitude                 12.913041
Restaurant_longitude                77.683237
Delivery_location_latitude          13.043041
Delivery_location_longitude         77.813237
Order_Date                         25-03-2022
Time_Orderd                             19:45
Time_Order_picked                       19:50
Weather conditions                     Stormy
Road_traffic_density                      Jam
Vehicle_condition                           2
Type_of_order                           Snack
Type_of_vehicle                       scooter
multiple_deliveries                  1.000000
Festival                                   No
City                            Metropolitian
Time_taken (min)                    33.000000


 

CodePudding user response:

Try it like this.

import glob
with open('my_file.csv', 'a') as csv_file:
    for path in glob.glob('./*.txt'):
        with open(path) as txt_file:
            txt = txt_file.read()   '\n'
            csv_file.write(txt)

CodePudding user response:

CSV is a very simple data format for which you don't need any sophisticated tools to handle. Just text and separators.

In your hopefully simple case there is no need to use pandas and dictionaries.

Except your datafiles are corrupt missing some columns or having some additional columns to skip. But even in this case you can handle such issues better within your own code so you have more control over it and are able to get results within seconds.

Assuming your datafiles are not corrupt having all columns in the right order with no missing columns or having additional ones (so you can rely on their proper formatting), just try this code:

from time import perf_counter as T
sT = T()
filesProcessed = 0
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
import glob, os
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
csv_lines = []
csv_line_counter = 0
for filename in file_list:
    filesProcessed  = 1
    with open(filename) as datafile:
        csv_line = ""
        for line in datafile.read().splitlines():
            # print(line)
            var = line.partition("  ")[-1]
            csv_line  = var.strip()   ';'
        csv_lines.append(str(csv_line_counter) ';' csv_line[:-1])
        csv_line_counter  = 1
with open("train.csv", "w") as csvfile:
    csvfile.write(';' ';'.join(columns) '\n')
    csvfile.write('\n'.join(csv_lines))
eT = T()
print(f'> {filesProcessed=}, {(eT-sT)=:8.6f}')

I guess you will get the result in a speed beyond your expectations (in seconds, not minutes or hours)

On my computer, estimating from processing time of 100 files the time required for 50.000 files will be about 3 seconds.

  • Related