How to convert a non-fixed width spaced delimited file to a pandas dataframe-CodePudding

ID                                     0x4607
Delivery_person_ID             INDORES13DEL02
Delivery_person_Age                 37.000000
Delivery_person_Ratings              4.900000
Restaurant_latitude                 22.745049
Restaurant_longitude                75.892471
Delivery_location_latitude          22.765049
Delivery_location_longitude         75.912471
Order_Date                         19-03-2022
Time_Orderd                             11:30
Time_Order_picked                       11:45
Weather conditions                      Sunny
Road_traffic_density                     High
Vehicle_condition                           2
Type_of_order                           Snack
Type_of_vehicle                    motorcycle
multiple_deliveries                  0.000000
Festival                                   No
City                                    Urban
Time_taken (min)                    24.000000
Name: 0, dtype: object

In an online exam, the machine learning training dataset has been split into multiple txt files. The file contains data as shown in the image. I am unable to understand how to read this data in python and convert it to a pandas dataframe. There are more than 45,000 txt files each containing data of a record of the dataset. I will have to merge those 45,000 txt files into a single .csv file. Any help will be highly appreciated.

CodePudding user response：

Each of your txt files seems to contain only 1 row (as a Series).

Unfortunately, these rows are not in an easy-to-read format (for the machines) - looks like they were just printed out and saved like that.

Because of this in my solution the indices of the dataframe (which correspond to the Name - in last row of each file) won't be read: my final dataframe will be reindexed.

You'll have to iterate through all your files. Just for my example, I'm using a list of the file names:

file_names = ['file0.txt', 'file1.txt']

rows = [pd.read_csv(file_name, sep='\s\s ', header=None, index_col=0, skipfooter=1, engine='python').iloc[:, 0]
        for file_name in file_names]

df = pd.DataFrame(rows).reset_index(drop=True)

CodePudding user response：

You can simply use basic python to do it with something like:

data = """ID                                     0x4607
Delivery_person_ID             INDORES13DEL02
Delivery_person_Age                 37.000000
Delivery_person_Ratings              4.900000
Restaurant_latitude                 22.745049
Restaurant_longitude                75.892471
Delivery_location_latitude          22.765049
Delivery_location_longitude         75.912471
Order_Date                         19-03-2022
Time_Orderd                             11:30
Time_Order_picked                       11:45
Weather conditions                      Sunny
Road_traffic_density                     High
Vehicle_condition                           2
Type_of_order                           Snack
Type_of_vehicle                    motorcycle
multiple_deliveries                  0.000000
Festival                                   No
City                                    Urban
Time_taken (min)                    24.000000"""

for line in data.split('\n'):
    content = line.split()

    name = ' '.join(content[:-1])
    value = content[-1]

    print(name, value)

And from the moment that you have the name and the value you can add them to a panda dataframe