I try read csv and split data into 2 column. I try went with some regex separators like (?<=_.*)\s but python return "re.error: look-behind requires fixed-width pattern". other variants \s (?![^_\S ]) give more than 2 columns.
Could someone help me find solution?
pd.read_csv('out.txt', header=None, sep=r"(?<=_.*)\s ", skiprows=2, engine='python', keep_default_na=False)
_journal_issue 23
_journal_name_full 'Physical Chemistry and Chemical Physics'
_journal_page_first 9197
_journal_page_last 9204
_journal_paper_doi 10.1039/c3cp50853f
_journal_volume 15
_journal_year 2013
_chemical_compound_source 'Corrosion product'
_chemical_formula_structural 'Fe3 O4'
_chemical_formula_sum 'Fe3 O4'
_chemical_name_mineral Magnetite
_chemical_name_systematic 'Iron diiron(III) oxide'
_space_group_crystal_system cubic
_space_group_IT_number 227
_space_group_name_Hall 'F 4d 2 3 -1d'
_space_group_name_H-M_alt 'F d -3 m :1'
_cell_angle_alpha 90
_cell_angle_beta 90
_cell_angle_gamma 90
_cell_formula_units_Z 8
_cell_length_a 8.36
_cell_length_b 8.36
_cell_length_c 8.36
_raman_determination_method experimental
_[local]_chemical_compound_color Black
_[local]_chemical_compound_state Solid
_raman_measurement_device.location 'IMMM Maine university'
_raman_measurement_device.company 'HORIBA Jobin Yvon'
_raman_measurement_device.model T64000
_raman_measurement_device.optics_type microscope
_raman_measurement_device.microscope_system dispersive
_raman_measurement_device.microscope_objective_magnification 100
_raman_measurement_device.microscope_numerical_aperture 0.90
_raman_measurement_device.excitation_laser_type Argon-Krypton
_raman_measurement_device.excitation_laser_wavelength 514
_raman_measurement_device.configuration simple
_raman_measurement_device.resolution 3
_raman_measurement_device.power_on_sample 2
_raman_measurement_device.direction_polarization unoriented
_raman_measurement_device.spot_size 0.8
_raman_measurement_device.diffraction_grating 600
_raman_measurement.environment air
_raman_measurement.environment_details
_raman_measurement.temperature 300
_raman_measurement.pressure 100
_raman_measurement.background_subtraction no
_raman_measurement.background_subtraction_details
_raman_measurement.baseline_correction no
_raman_measurement.baseline_correction_details
CodePudding user response:
Try
df = pd.read_csv(
"out.txt", header=None, delim_whitespace=True, quotechar="'", keep_default_na=False
)
Result for your sample:
0 1
0 _journal_issue 23
1 _journal_name_full Physical Chemistry and Chemical Physics
2 _journal_page_first 9197
...
46 _raman_measurement.background_subtraction_details
47 _raman_measurement.baseline_correction no
48 _raman_measurement.baseline_correction_details
CodePudding user response:
As per pandas documentation, pd.read_csv, you can provide sep
as only string and string like r"<String>"
is usually used for Raw string.
What I would recommend is first loop through file and replace all delimiter to a common delimiter and then feed file to pandas read_csv.