Home > database >  Parse CSV headers that have singular spaces inbetween header names but larger spaces between columns
Parse CSV headers that have singular spaces inbetween header names but larger spaces between columns

Time:08-31

I am parsing netstat data and my data looks like this

Proto Local Address Foreign Adress State PID Process name
TCP [0.0.0.0:7] 0.0.0.0:0 LISTENING 4112 tcpsvcs.exe
TCP 0.0.0.0:111 0.0.0.0:0 LISTENING 4 Can not obtain ownership information

When I try to parse this using this Python command

pd.read_csv("C:\\Users\\xxxxx\\Documents\\test.csv", on_bad_lines='skip', header=1, delim_whitespace= True)

my data ends up looking like this (Notice how Process_name isn't properly filled out and everything is being shifted to the left due to the space between Local address and foreign address):

Proto Local Address Foreign Address State PID Process_name
TCP [0.0.0.0:7] 0.0.0.0:0 LISTENING 4112 tcpsvcs.exe
TCP 0.0.0.0:111 0.0.0.0:0 LISTENING 4 Can not obtain ownership information

When it should end up looking like this:

Proto Local Address Foreign Address State PID Process_name
TCP [0.0.0.0:7] 0.0.0.0:0 LISTENING 4112 tcpsvcs.exe
TCP 0.0.0.0:111 0.0.0.0:0 LISTENING 4 Can not obtain ownership information

And that's solely due to the spacing of the headers. Right now I am just adding my own headers to my CSV file but the only issue with that is my file is extremely large and appending a header row at the top of a CSV is very heavy. I was wondering if there are any ways to parse the header that I do have and ignore the singular space and only delimit the spaces that are larger than the singular space that I have.

CodePudding user response:

Use sep='\s\s ' on read_csv to separate columns that are separated by more than one space:

Example Data

sim_csv = io.StringIO(
'''  Proto  Local Address          Foreign Address        State           PID    Process_name
  TCP    0.0.0.0:80             0.0.0.0:0              LISTENING       4      tcpsvcs.exe   
  TCP    0.0.0.0:135            0.0.0.0:0              LISTENING       1552   Can not obtain ownership information''')

Execute

df = pd.read_csv(sim_csv, sep='\s\s ')
print(df)

Result

   Proto Local Address Foreign Address      State   PID                          Process_name 
0   TCP    0.0.0.0:80       0.0.0.0:0  LISTENING     4                            tcpsvcs.exe 
1   TCP   0.0.0.0:135       0.0.0.0:0  LISTENING  1552   Can not obtain ownership information 

CodePudding user response:

I cannot test it with the raw file but try this:

pd.read_csv("C:\\Users\\xxxxx\\Documents\\test.csv", on_bad_lines='skip', header=1, sep="\t")
  • Related