I tried to insert header row for csv file that contains 35M rows using the following segment of code
import csv
with open('E:\\Dataset\\dataset1.csv') as infile:
text = infile.read()
header = ['User IP','Top-level domain', 'Timestamp', 'Is Attack', 'Request',
'Len(request) withou TLD', 'Subdomains_count', 'w_count', 'w_max',
'entropy', 'w_max_ratio', 'w_count_ratio', 'digits_ratio', 'uppercase_ratio',
'time_avg', 'time_stdev', 'size_avg', 'size stdev', 'throughput', 'unique', 'entropy_avg'
'entropy_stdev']
with open('E:\\Dataset\\dataset2.csv', 'w') as outfile:
# join the headers into a string with commas and add a newline
outfile.write(f"{','.join(header)}\n")
outfile.write(text)
However, when I tried to print data with the headerو the headers shifted one column. this is the original head of data (without header)
186.169.253.58 surbl.org 1624438272607 False \
0 186.169.253.58 surbl.org 1624438272607 False
1 186.169.253.58 spamhaus.org 1624438273058 False
2 186.169.253.58 spamhaus.org 1624438273058 False
3 186.169.253.58 spamhaus.org 1624438273059 False
4 186.169.253.58 spamhaus.org 1624438273059 False
h.surbl.org 1 1.1 0 0.1 -0.0 0.0 \
0 f.surbl.org 1 1 0 0 -0.000000 0.000000
1 118.141.11.106.sbl.spamhaus.org 18 5 0 0 2.633731 0.000000
2 118.141.11.106.zen.spamhaus.org 18 5 1 3 2.633731 0.166667
3 128.141.11.106.sbl.spamhaus.org 18 5 0 0 2.863826 0.000000
4 128.141.11.106.zen.spamhaus.org 18 5 1 3 2.863826 0.166667
0.0.1 0.0.2 0.0.3 3.4444444444444446 9.59311095410544 1.5 \
0 0.000000 0.000000 0.0 0.222222 0.440959 1.0
1 0.000000 0.611111 0.0 55.555556 165.542375 17.2
2 0.055556 0.611111 0.0 0.333333 0.500000 17.2
3 0.000000 0.611111 0.0 0.333333 0.500000 17.3
4 0.055556 0.611111 0.0 0.333333 0.500000 17.4
1.5811388300841898 468.75 0.4444444444444444 0.25849625007211563 \
0 0.000000 3333.333333 0.555556 0.000000
1 0.421637 343.313373 0.000000 3.048277
2 0.421637 43000.000000 0.000000 2.983547
3 0.483046 43250.000000 0.000000 2.959741
4 0.516398 43500.000000 0.000000 2.935936
0.81743691684035
0 0.000000
1 0.177285
2 0.199622
3 0.198131
4 0.193400
and this is the head of data after adding header
User IP Top-level domain Timestamp \
186.169.253.58 surbl.org 1624438272607 False
186.169.253.58 surbl.org 1624438272607 False
186.169.253.58 spamhaus.org 1624438273058 False
186.169.253.58 spamhaus.org 1624438273058 False
186.169.253.58 spamhaus.org 1624438273059 False
Is Attack Request \
186.169.253.58 h.surbl.org 1
186.169.253.58 f.surbl.org 1
186.169.253.58 118.141.11.106.sbl.spamhaus.org 18
186.169.253.58 118.141.11.106.zen.spamhaus.org 18
186.169.253.58 128.141.11.106.sbl.spamhaus.org 18
Len(request) withou TLD Subdomains_count w_count w_max \
186.169.253.58 1 0 0 -0.000000
186.169.253.58 1 0 0 -0.000000
186.169.253.58 5 0 0 2.633731
186.169.253.58 5 1 3 2.633731
186.169.253.58 5 0 0 2.863826
entropy w_max_ratio w_count_ratio digits_ratio \
186.169.253.58 0.000000 0.000000 0.000000 0.0
186.169.253.58 0.000000 0.000000 0.000000 0.0
186.169.253.58 0.000000 0.000000 0.611111 0.0
186.169.253.58 0.166667 0.055556 0.611111 0.0
186.169.253.58 0.000000 0.000000 0.611111 0.0
uppercase_ratio time_avg time_stdev size_avg \
186.169.253.58 3.444444 9.593111 1.5 1.581139
186.169.253.58 0.222222 0.440959 1.0 0.000000
186.169.253.58 55.555556 165.542375 17.2 0.421637
186.169.253.58 0.333333 0.500000 17.2 0.421637
186.169.253.58 0.333333 0.500000 17.3 0.483046
size stdev throughput unique entropy_avgentropy_stdev
186.169.253.58 468.750000 0.444444 0.258496 0.817437
186.169.253.58 3333.333333 0.555556 0.000000 0.000000
186.169.253.58 343.313373 0.000000 3.048277 0.177285
186.169.253.58 43000.000000 0.000000 2.983547 0.199622
186.169.253.58 43250.000000 0.000000 2.959741 0.198131
it looks like using the first column as index.
CodePudding user response:
If you are able to leverage it, use Pandas, as the header issues you're fighting are addressed by the explicit import and export flags available.
import pandas
header = ['User IP', 'Top-level domain', 'Timestamp', 'Is Attack', 'Request',
'Len(request) withou TLD', 'Subdomains_count', 'w_count', 'w_max',
'entropy', 'w_max_ratio', 'w_count_ratio', 'digits_ratio', 'uppercase_ratio',
'time_avg', 'time_stdev', 'size_avg', 'size stdev', 'throughput', 'unique', 'entropy_avg',
'entropy_stdev']
pandas.read_csv('E:\\Dataset\\dataset1.csv', names=header, index_col=False)
pandas.to_csv('E:\\Dataset\\dataset2.csv', header=True, index=False)
It's unclear in your question if you want the indexes in the output, or if they were present in the input. If they're present in the input, set index_col=0. If you want them in the output, set index=True in the to_csv call.