I need to create a pandas dataframe based on 4 txt files with comments (to skip while reading) based on the following structure:
# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=Tondeuse
# Propriétés générales
hauteur=0.5
masse=20.0
prix=110.00
# Propriétés du moteur
impulsion specifique=80
and
# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=Civic VTEC
# Propriétés générales
hauteur=2.0
masse=3000.0
prix=2968.00
# Propriétés du moteur
impulsion specifique=205
and
# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=VelociRAPTOR
# Propriétés générales
hauteur=4.0
masse=2000.0
prix=6000.00
# Propriétés du moteur
impulsion specifique=250
and
# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=La Puissance
# Propriétés générales
hauteur=12.0
masse=15000.0
prix=39000.00
# Propriétés du moteur
impulsion specifique=295
That's the result I need to have:
nom hauteur masse prix impulsion specifique
0 Tondeuse 0.5 20.0 110.0 80
1 Civic VTEC 2.0 3000.0 2968.0 205
2 VelociRAPTOR 4.0 2000.0 6000.0 250
3 La Puissance 12.0 15000.0 39000.0 295
I don't know if it's possible, but that's what i was asked to do
CodePudding user response:
welcome to Stackoverflow! :)
If your txt files have their content like you just showed, you could read them in using pandas as a CSV file.
The pandas.read_csv function has multiple things that will help you here:
- It outputs a dataframe, which is the format you would like to end up with
- Has a
comment
input argument, with which you can define lines that are to be ignored - You can use the
=
sign as a separator, which will make you able to split up your data in the wanted sections
Now, let's try to read one of your files using the read_csv
function:
import pandas as pd
df = pd.read_csv(file, comment='#', sep='=', header=None)
df
nom Tondeuse
0 hauteur 0.5
1 masse 20.0
2 prix 110.0
3 impulsion specifique 80.0
We're not completely there yet. We want to remove that index column that gives no info, and we want to transpose the dataframe (rows <-> columns) to be able to concatenate all dataframes together. Let's do it!
import pandas as pd
df = pd.read_csv(file, comment='#', sep='=', header=None, index_col=0).T
df
0 nom hauteur masse prix impulsion specifique
1 Tondeuse 0.5 20.0 110.00 80
That's looking way better! Putting index_col=0
makes the lefternmost column be the index column, and the .T
at the end transposes your dataframe. Now we just need to put this inside of a loop and make a complete script out of it!
import pandas as pd
import glob
import os
files = glob.glob(os.path.join(path, '*.csv'))
all_dfs = []
for file in files:
current_df = pd.read_csv(file, comment='#', sep='=', header=None, index_col=0).T
all_dfs.append(current_df)
total_df = pd.concat(all_dfs)
total_df
0 nom hauteur masse prix impulsion specifique
1 La Puissance 12.0 15000.0 39000.00 295
1 Civic VTEC 2.0 3000.0 2968.00 205
1 VelociRAPTOR 4.0 2000.0 6000.00 250
1 Tondeuse 0.5 20.0 110.00 80
Notice that you still have that lefternmost column with the index number, I did not clean it out because I wasn't sure of what you wanted there.
Also, importantly, you need to be aware that if there is a slight difference in the names of the columns in your files (e.g. impulsion specifique
vs impulsion spécifique
) this will bring errors. You will need to create error handling procedures for these. Or maybe enforcing a certain schema, but that is out of the scope of this question.
I hope this helps!
CodePudding user response:
Your data files look very close to configuration files. You can use configparser to generate a dictionary from each file:
from pathlib import Path
from configparser import ConfigParser
data = []
for file in Path("data").glob("*.txt"):
parser = ConfigParser()
# INI file requires a section header. Yours don't have one.
# So let's give it one called DEFAULT
parser.read_string("[DEFAULT]\n" file.read_text())
data.append(dict(parser.items("DEFAULT")))
df = pd.DataFrame(data)