I've got a problem creating a dataframe from a ascii file with blank spaces.
The raw data has the following format.
Creating a csv-file with the delimiter \s works. But I want the blank spaces as NaN. With my actual script, blank spaces are ignored.
I already tried to replace the blank spaces but that did not work.
The reason why I need these NaN is to merge every second row to the row above. For that, I split the data frame into two. Afterwards I renamed the columns of the second df and after that I merged the two frames back together. Therefore the two dataframes should have the same format.
The print-lines are only for reference in my console and will be deleted in the final version.
Raw data
I set the data in code format to show the original format.
10 N0496 Position 70.990 0.600 71.123 0.268 ***---
142.10 22.920 22.936
11 N0497 Position 71.100 0.600 71.421 0.650 |--->>
142.11 47.750 47.802 0.050
12 N0498 Position 40.820 0.600 40.827 0.151 **----
142.12 41.410 41.335
101 N0501 Durchm. 2.000 0.500 2.004 0.004 --****-----
140.1 -0.090
102 N0502 Durchm. 2.000 0.500 2.000 0.000 --****-----
140.2 -0.090
103 N0503 Durchm. 2.000 0.500 1.930 -0.070 ******-----
140.3 -0.090
104 N0504 Durchm. 2.000 0.500 1.903 -0.097 <<--- -----
140.4 -0.090 -0.007
Code:
import os
import pandas as pd
import numpy as np
input = "C:\\Users\\user\\Desktop\\Messprotokolle\\" #input files
output = "C:\\Users\\user\\Desktop\\CSVFiles1\\" #output files
#select only asc files
os.chdir(input)
asc_files = os.listdir('.')
for asc_file in (asc_files):
if asc_file.endswith(".asc"): #only for .asc
asc_df = pd.read_csv(asc_file, sep = '\s ',
names=['measurement_point','specified_value2', 'measurement_value2','D','E','F','G','H'])
asc_df.replace(r'\s ', np.nan, regex=True)
#print(asc_df)
asc_df.to_csv(output asc_file '.csv')
#formatting_ASC
os.chdir(output)
csv_files = os.listdir('.')
for csv_file in (csv_files):
if csv_file.endswith(".asc.csv"):
df = pd.read_csv(csv_file)
#print (df)
#keep_col = ['measurement_point','specified_value2', 'measurement_value1', 'D', 'E','F','G']
new_df = df#[keep_col]
#print (new_df)
new_df = new_df[~new_df['measurement_point'].isin(['**Teil'])] #removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**T'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**KS-Oben'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**KS-Unten'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**N'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**ME1'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**ME2/3'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**ME5'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**ME8'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**Punkte'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**Punkte unten (1,3,5,6,7,9,11,13,16,18,19.5,'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['**21,23,45,27,29,34,36,38,41,43,44,46,48)'])]#removing unwanted strings
new_df = new_df[~new_df['measurement_point'].isin(['XXXXX'])]#removing unwanted strings
new_df = new_df.reset_index(drop=True)
print(new_df)
new_df.to_csv(output csv_file)
df1 = new_df[new_df.index % 2 ==1]#splitting the original frame into two
df2 = new_df[new_df.index % 2 ==0]
dict2 = {'measurement_point':'index', 'specified_value2':'programm_line', 'measurement_value2':'type', 'D':'specified_value1', 'E':'tolerance_value1_upper', 'F':'measurement_value1','G':'deviation_value1'}
df2.rename(columns=dict2,inplace=True)
print (df1)
print (df2)
right = df1.reset_index(drop=True)
left = df2.reset_index(drop=True)
#print(right)
#print(left)
merge_df = pd.merge(left, right, on=right.index)
merge_df.index = merge_df.index 1
#print (merge_df)
keep_col1= ['measurement_point','specified_value2', 'measurement_value2','type','specified_value1','tolerance_value1_upper','measurement_value1','deviation_value1',]
final_df = merge_df[keep_col1]
#final_df.to_csv(output csv_file)
Output format
66,68,10,N0496,Position,70.990,0.600,71.123,0.268,***---
67,69,142.10,22.920,22.936,,,,,
68,70,11,N0497,Position,71.100,0.600,71.421,0.650,|--->>
69,71,142.11,47.750,47.802,0.050,,,,
70,72,12,N0498,Position,40.820,0.600,40.827,0.151,**----
71,73,142.12,41.410,41.335,,,,,
72,74,101,N0501,Durchm.,2.000,0.500,2.004,0.004,--****-----
73,75,140.1,-0.090,,,,,,
74,76,102,N0502,Durchm.,2.000,0.500,2.000,0.000,--****-----
75,77,140.2,-0.090,,,,,,
76,78,103,N0503,Durchm.,2.000,0.500,1.930,-0.070,******-----
77,79,140.3,-0.090,,,,,,
78,80,104,N0504,Durchm.,2.000,0.500,1.903,-0.097,<<--- -----
79,81,140.4,-0.090,-0.007,,,,,
Desired output format
66,68,10,N0496,Position,70.990,0.600,71.123,0.268,***---
67,69,,,142.10,22.920,,22.936,,
68,70,11,N0497,Position,71.100,0.600,71.421,0.650,|--->>
69,71,,,142.11,47.750,47.802,,0.050,,
70,72,12,N0498,Position,40.820,0.600,40.827,0.151,**----
71,73,,,142.12,41.410,,41.335,,
72,74,101,N0501,Durchm.,2.000,0.500,2.004,0.004,--****-----
73,75,,,140.1,-0.090,,,,
74,76,102,N0502,Durchm.,2.000,0.500,2.000,0.000,--****-----
75,77,,,140.2,-0.090,,,,
76,78,103,N0503,Durchm.,2.000,0.500,1.930,-0.070,******-----
77,79,,,140.3,-0.090,,,,
78,80,104,N0504,Durchm.,2.000,0.500,1.903,-0.097,<<--- -----
79,81,,,140.4,-0.090,-0.007,,,
I know that it is a very specific problem but I can't solve it by myself. I highly appreciate some help. Thanks!
When using (' ') as delimiter, I get the following output
40,,,8,N0481,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.6860000000000004,,,,,-0.064,-----***---,,,,,,,
41,,,,,,,,,,,,,,,,,,139.8,,,,,,,,,,,,,,,,-0.200,,,,,,,
42,,,9,N0482,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.668,,,,,-0.082,-----**----,,,,,,,
43,,,,,,,,,,,,,,,,,,139.9,,,,,,,,,,,,,,,,-0.200,,,,,,,
44,,10,N0483,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.6860000000000004,,,,,-0.064,-----***---,,,,,,,,
45,,,,,,,,,,,,,,,,,139.1,,,,,,,,,,,,,,,,-0.200,,,,,,,,
46,,11,N0484,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.66,,,,,-0.090,-----**----,,,,,,,,
47,,,,,,,,,,,,,,,,,139.11,,,,,,,,,,,,,,,,-0.200,,,,,,,,
There I can't name the columns because of the first number.
CodePudding user response:
Try to replace it with the empty string (''
):
asc_df.replace(r'\s ', '', regex=True)
CodePudding user response:
I solved the problem myself. I mistakenly thought, that the original file was separated with \s
. But the file is a fixed-width file and by reading in the file with pd.read_fwf
I got the right format.