Creating a data frame with blank spaces-CodePudding

I've got a problem creating a dataframe from a ascii file with blank spaces.

The raw data has the following format.

Creating a csv-file with the delimiter \s works. But I want the blank spaces as NaN. With my actual script, blank spaces are ignored.

I already tried to replace the blank spaces but that did not work.

The reason why I need these NaN is to merge every second row to the row above. For that, I split the data frame into two. Afterwards I renamed the columns of the second df and after that I merged the two frames back together. Therefore the two dataframes should have the same format.

The print-lines are only for reference in my console and will be deleted in the final version.

Raw data

I set the data in code format to show the original format.

 10 N0496 Position         70.990      0.600     71.123      0.268      ***---
                142.10     22.920                22.936                     
 11 N0497 Position         71.100      0.600     71.421      0.650      |--->>
                142.11     47.750                47.802      0.050          
 12 N0498 Position         40.820      0.600     40.827      0.151      **----
                142.12     41.410                41.335                     
101 N0501 Durchm.           2.000      0.500      2.004      0.004 --****-----
                 140.1                -0.090                      
102 N0502 Durchm.           2.000      0.500      2.000      0.000 --****-----
                 140.2                -0.090                      
103 N0503 Durchm.           2.000      0.500      1.930     -0.070 ******-----
                 140.3                -0.090                      
104 N0504 Durchm.           2.000      0.500      1.903     -0.097 <<--- -----
                 140.4                -0.090                -0.007

Code:

import os
import pandas as pd
import numpy as np

input = "C:\\Users\\user\\Desktop\\Messprotokolle\\" #input files
output = "C:\\Users\\user\\Desktop\\CSVFiles1\\" #output files

#select only asc files
os.chdir(input)
asc_files = os.listdir('.') 
for asc_file in (asc_files):
    if asc_file.endswith(".asc"): #only for .asc 
            asc_df = pd.read_csv(asc_file, sep = '\s ', 
             names=['measurement_point','specified_value2', 'measurement_value2','D','E','F','G','H'])
            asc_df.replace(r'\s ', np.nan, regex=True)
            #print(asc_df)
            asc_df.to_csv(output asc_file '.csv') 
#formatting_ASC
os.chdir(output)
csv_files = os.listdir('.')
for csv_file in (csv_files):
        if csv_file.endswith(".asc.csv"):
            df = pd.read_csv(csv_file)
            #print (df)
            #keep_col = ['measurement_point','specified_value2', 'measurement_value1', 'D', 'E','F','G']
            new_df = df#[keep_col]
            #print (new_df)
            new_df = new_df[~new_df['measurement_point'].isin(['**Teil'])] #removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**T'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**KS-Oben'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**KS-Unten'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**N'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**ME1'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**ME2/3'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**ME5'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**ME8'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**Punkte'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**Punkte unten (1,3,5,6,7,9,11,13,16,18,19.5,'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**21,23,45,27,29,34,36,38,41,43,44,46,48)'])]#removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['XXXXX'])]#removing unwanted strings
            new_df = new_df.reset_index(drop=True)
            print(new_df)   
            new_df.to_csv(output csv_file)   
            df1 = new_df[new_df.index % 2 ==1]#splitting the original frame into two
            df2 = new_df[new_df.index % 2 ==0]
            dict2 = {'measurement_point':'index', 'specified_value2':'programm_line', 'measurement_value2':'type', 'D':'specified_value1', 'E':'tolerance_value1_upper', 'F':'measurement_value1','G':'deviation_value1'}
            df2.rename(columns=dict2,inplace=True)
            print (df1)
            print (df2)
            right = df1.reset_index(drop=True)
            left = df2.reset_index(drop=True)
            #print(right)
            #print(left)
            merge_df = pd.merge(left, right, on=right.index)
            merge_df.index = merge_df.index   1
            #print (merge_df)
            keep_col1= ['measurement_point','specified_value2', 'measurement_value2','type','specified_value1','tolerance_value1_upper','measurement_value1','deviation_value1',]
            final_df = merge_df[keep_col1]
            #final_df.to_csv(output csv_file)

Output format

66,68,10,N0496,Position,70.990,0.600,71.123,0.268,***---
67,69,142.10,22.920,22.936,,,,,
68,70,11,N0497,Position,71.100,0.600,71.421,0.650,|--->>
69,71,142.11,47.750,47.802,0.050,,,,
70,72,12,N0498,Position,40.820,0.600,40.827,0.151,**----
71,73,142.12,41.410,41.335,,,,,
72,74,101,N0501,Durchm.,2.000,0.500,2.004,0.004,--****-----
73,75,140.1,-0.090,,,,,,
74,76,102,N0502,Durchm.,2.000,0.500,2.000,0.000,--****-----
75,77,140.2,-0.090,,,,,,
76,78,103,N0503,Durchm.,2.000,0.500,1.930,-0.070,******-----
77,79,140.3,-0.090,,,,,,
78,80,104,N0504,Durchm.,2.000,0.500,1.903,-0.097,<<--- -----
79,81,140.4,-0.090,-0.007,,,,,

Desired output format

66,68,10,N0496,Position,70.990,0.600,71.123,0.268,***---
67,69,,,142.10,22.920,,22.936,,
68,70,11,N0497,Position,71.100,0.600,71.421,0.650,|--->>
69,71,,,142.11,47.750,47.802,,0.050,,
70,72,12,N0498,Position,40.820,0.600,40.827,0.151,**----
71,73,,,142.12,41.410,,41.335,,
72,74,101,N0501,Durchm.,2.000,0.500,2.004,0.004,--****-----
73,75,,,140.1,-0.090,,,,
74,76,102,N0502,Durchm.,2.000,0.500,2.000,0.000,--****-----
75,77,,,140.2,-0.090,,,,
76,78,103,N0503,Durchm.,2.000,0.500,1.930,-0.070,******-----
77,79,,,140.3,-0.090,,,,
78,80,104,N0504,Durchm.,2.000,0.500,1.903,-0.097,<<--- -----
79,81,,,140.4,-0.090,-0.007,,,

I know that it is a very specific problem but I can't solve it by myself. I highly appreciate some help. Thanks!

When using (' ') as delimiter, I get the following output

40,,,8,N0481,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.6860000000000004,,,,,-0.064,-----***---,,,,,,,
41,,,,,,,,,,,,,,,,,,139.8,,,,,,,,,,,,,,,,-0.200,,,,,,,
42,,,9,N0482,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.668,,,,,-0.082,-----**----,,,,,,,
43,,,,,,,,,,,,,,,,,,139.9,,,,,,,,,,,,,,,,-0.200,,,,,,,
44,,10,N0483,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.6860000000000004,,,,,-0.064,-----***---,,,,,,,,
45,,,,,,,,,,,,,,,,,139.1,,,,,,,,,,,,,,,,-0.200,,,,,,,,
46,,11,N0484,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.66,,,,,-0.090,-----**----,,,,,,,,
47,,,,,,,,,,,,,,,,,139.11,,,,,,,,,,,,,,,,-0.200,,,,,,,,

There I can't name the columns because of the first number.

CodePudding user response：

Try to replace it with the empty string (''):

asc_df.replace(r'\s ', '', regex=True)

CodePudding user response：

I solved the problem myself. I mistakenly thought, that the original file was separated with \s . But the file is a fixed-width file and by reading in the file with pd.read_fwf I got the right format.