Insert header row for csv file in Python-CodePudding

I tried to insert header row for csv file that contains 35M rows using the following segment of code


    import csv
    
    with open('E:\\Dataset\\dataset1.csv') as infile:
        text = infile.read()
    header = ['User IP','Top-level domain', 'Timestamp', 'Is Attack', 'Request',
                  'Len(request) withou TLD', 'Subdomains_count', 'w_count', 'w_max',
                  'entropy', 'w_max_ratio', 'w_count_ratio', 'digits_ratio', 'uppercase_ratio',
                  'time_avg', 'time_stdev', 'size_avg', 'size stdev', 'throughput', 'unique', 'entropy_avg'
                  'entropy_stdev']
    
    with open('E:\\Dataset\\dataset2.csv', 'w') as outfile:
        # join the headers into a string with commas and add a newline
        outfile.write(f"{','.join(header)}\n") 
        outfile.write(text)

However, when I tried to print data with the headerو the headers shifted one column. this is the original head of data (without header)

   186.169.253.58     surbl.org  1624438272607  False  \
0  186.169.253.58     surbl.org  1624438272607  False   
1  186.169.253.58  spamhaus.org  1624438273058  False   
2  186.169.253.58  spamhaus.org  1624438273058  False   
3  186.169.253.58  spamhaus.org  1624438273059  False   
4  186.169.253.58  spamhaus.org  1624438273059  False   

                       h.surbl.org   1  1.1  0  0.1      -0.0       0.0  \
0                      f.surbl.org   1    1  0    0 -0.000000  0.000000   
1  118.141.11.106.sbl.spamhaus.org  18    5  0    0  2.633731  0.000000   
2  118.141.11.106.zen.spamhaus.org  18    5  1    3  2.633731  0.166667   
3  128.141.11.106.sbl.spamhaus.org  18    5  0    0  2.863826  0.000000   
4  128.141.11.106.zen.spamhaus.org  18    5  1    3  2.863826  0.166667   

      0.0.1     0.0.2  0.0.3  3.4444444444444446  9.59311095410544   1.5  \
0  0.000000  0.000000    0.0            0.222222          0.440959   1.0   
1  0.000000  0.611111    0.0           55.555556        165.542375  17.2   
2  0.055556  0.611111    0.0            0.333333          0.500000  17.2   
3  0.000000  0.611111    0.0            0.333333          0.500000  17.3   
4  0.055556  0.611111    0.0            0.333333          0.500000  17.4   

   1.5811388300841898        468.75  0.4444444444444444  0.25849625007211563  \
0            0.000000   3333.333333            0.555556             0.000000   
1            0.421637    343.313373            0.000000             3.048277   
2            0.421637  43000.000000            0.000000             2.983547   
3            0.483046  43250.000000            0.000000             2.959741   
4            0.516398  43500.000000            0.000000             2.935936   

   0.81743691684035  
0          0.000000  
1          0.177285  
2          0.199622  
3          0.198131  
4          0.193400

and this is the head of data after adding header

                     User IP  Top-level domain  Timestamp  \
186.169.253.58     surbl.org     1624438272607      False   
186.169.253.58     surbl.org     1624438272607      False   
186.169.253.58  spamhaus.org     1624438273058      False   
186.169.253.58  spamhaus.org     1624438273058      False   
186.169.253.58  spamhaus.org     1624438273059      False   

                                      Is Attack  Request  \
186.169.253.58                      h.surbl.org        1   
186.169.253.58                      f.surbl.org        1   
186.169.253.58  118.141.11.106.sbl.spamhaus.org       18   
186.169.253.58  118.141.11.106.zen.spamhaus.org       18   
186.169.253.58  128.141.11.106.sbl.spamhaus.org       18   

                Len(request) withou TLD  Subdomains_count  w_count     w_max  \
186.169.253.58                        1                 0        0 -0.000000   
186.169.253.58                        1                 0        0 -0.000000   
186.169.253.58                        5                 0        0  2.633731   
186.169.253.58                        5                 1        3  2.633731   
186.169.253.58                        5                 0        0  2.863826   

                 entropy  w_max_ratio  w_count_ratio  digits_ratio  \
186.169.253.58  0.000000     0.000000       0.000000           0.0   
186.169.253.58  0.000000     0.000000       0.000000           0.0   
186.169.253.58  0.000000     0.000000       0.611111           0.0   
186.169.253.58  0.166667     0.055556       0.611111           0.0   
186.169.253.58  0.000000     0.000000       0.611111           0.0   

                uppercase_ratio    time_avg  time_stdev  size_avg  \
186.169.253.58         3.444444    9.593111         1.5  1.581139   
186.169.253.58         0.222222    0.440959         1.0  0.000000   
186.169.253.58        55.555556  165.542375        17.2  0.421637   
186.169.253.58         0.333333    0.500000        17.2  0.421637   
186.169.253.58         0.333333    0.500000        17.3  0.483046   

                  size stdev  throughput    unique  entropy_avgentropy_stdev  
186.169.253.58    468.750000    0.444444  0.258496                  0.817437  
186.169.253.58   3333.333333    0.555556  0.000000                  0.000000  
186.169.253.58    343.313373    0.000000  3.048277                  0.177285  
186.169.253.58  43000.000000    0.000000  2.983547                  0.199622  
186.169.253.58  43250.000000    0.000000  2.959741                  0.198131

it looks like using the first column as index.

CodePudding user response：

If you are able to leverage it, use Pandas, as the header issues you're fighting are addressed by the explicit import and export flags available.

import pandas

header = ['User IP', 'Top-level domain', 'Timestamp', 'Is Attack', 'Request',
          'Len(request) withou TLD', 'Subdomains_count', 'w_count', 'w_max',
          'entropy', 'w_max_ratio', 'w_count_ratio', 'digits_ratio', 'uppercase_ratio',
          'time_avg', 'time_stdev', 'size_avg', 'size stdev', 'throughput', 'unique', 'entropy_avg',
                                                                                      'entropy_stdev']
pandas.read_csv('E:\\Dataset\\dataset1.csv', names=header, index_col=False)
pandas.to_csv('E:\\Dataset\\dataset2.csv', header=True, index=False)

It's unclear in your question if you want the indexes in the output, or if they were present in the input. If they're present in the input, set index_col=0. If you want them in the output, set index=True in the to_csv call.