ID 0x4607
Delivery_person_ID INDORES13DEL02
Delivery_person_Age 37.000000
Delivery_person_Ratings 4.900000
Restaurant_latitude 22.745049
Restaurant_longitude 75.892471
Delivery_location_latitude 22.765049
Delivery_location_longitude 75.912471
Order_Date 19-03-2022
Time_Orderd 11:30
Time_Order_picked 11:45
Weather conditions Sunny
Road_traffic_density High
Vehicle_condition 2
Type_of_order Snack
Type_of_vehicle motorcycle
multiple_deliveries 0.000000
Festival No
City Urban
Time_taken (min) 24.000000
Name: 0, dtype: object
In an online exam, the machine learning training dataset has been split into multiple txt files. The file contains data as shown in the image. I am unable to understand how to read this data in python and convert it to a pandas dataframe. There are more than 45,000 txt files each containing data of a record of the dataset. I will have to merge those 45,000 txt files into a single .csv file. Any help will be highly appreciated.
CodePudding user response:
Each of your txt files seems to contain only 1 row (as a Series
).
Unfortunately, these rows are not in an easy-to-read format (for the machines) - looks like they were just printed out and saved like that.
Because of this in my solution the indices of the dataframe (which correspond to the Name
- in last row of each file) won't be read: my final dataframe will be reindexed.
You'll have to iterate through all your files. Just for my example, I'm using a list of the file names:
file_names = ['file0.txt', 'file1.txt']
rows = [pd.read_csv(file_name, sep='\s\s ', header=None, index_col=0, skipfooter=1, engine='python').iloc[:, 0]
for file_name in file_names]
df = pd.DataFrame(rows).reset_index(drop=True)
CodePudding user response:
You can simply use basic python to do it with something like:
data = """ID 0x4607
Delivery_person_ID INDORES13DEL02
Delivery_person_Age 37.000000
Delivery_person_Ratings 4.900000
Restaurant_latitude 22.745049
Restaurant_longitude 75.892471
Delivery_location_latitude 22.765049
Delivery_location_longitude 75.912471
Order_Date 19-03-2022
Time_Orderd 11:30
Time_Order_picked 11:45
Weather conditions Sunny
Road_traffic_density High
Vehicle_condition 2
Type_of_order Snack
Type_of_vehicle motorcycle
multiple_deliveries 0.000000
Festival No
City Urban
Time_taken (min) 24.000000"""
for line in data.split('\n'):
content = line.split()
name = ' '.join(content[:-1])
value = content[-1]
print(name, value)
And from the moment that you have the name and the value you can add them to a panda dataframe