Get specific data from txt file to pandas dataframe-CodePudding

I have such data in a txt file:

Wed Mar 23 16:59:25 GMT 2022
      1 State
      1 ESTAB

Wed Mar 23 16:59:26 GMT 2022
      1 State
      1 ESTAB
      1 CLOSE-WAIT

Wed Mar 23 16:59:27 GMT 2022
      1 State
      1 ESTAB
      10 FIN-WAIT

Wed Mar 23 16:59:28 GMT 2022
      1 State
      1 CLOSE-WAIT
      102 ESTAB

I want to get a pandas dataframe looking like this:

timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1

That means the string in the first line per paragraph should be used for the first column timestamp. The other columns should be filled withg the numbers according to the string following the number. The next column begins after a paragraph.

How can I do this with pandas?

CodePudding user response：

First you can process the txt file to a list of list. Inner list means each hunk lines. Outer list means different hunks:

import pandas as pd

with open('data.txt', 'r') as f:
    res = f.read()

records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]

print(records)

[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]

Then you can turn the list of list to list of dictionary by manually define each key and value

l = []
for record in records:
    d = {}
    d['timestamp'] = record[0]
    for r in record[1:]:
        key = r.split(' ')[1]
        value = r.split(' ')[0]
        d[key] = value

    l.append(d)

print(l)

[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]

At last you can feed this dictionary into dataframe and fill the nan cell

df = pd.DataFrame(l).fillna(0)

print(df)

                      timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0  Wed Mar 23 16:59:25 GMT 2022     1     1          0        0
1  Wed Mar 23 16:59:26 GMT 2022     1     1          1        0
2  Wed Mar 23 16:59:27 GMT 2022     1     1          0       10
3  Wed Mar 23 16:59:28 GMT 2022     1   102          1        0

CodePudding user response：

Try:

#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)

#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")

#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)

df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()

#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)

#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()

>>> output
Column                     timestamp  CLOSE-WAIT  ESTAB  FIN-WAIT  State
0       Wed Mar 23 16:59:25 GMT 2022           0      1         0      1
1       Wed Mar 23 16:59:26 GMT 2022           1      1         0      1
2       Wed Mar 23 16:59:27 GMT 2022           0      1        10      1
3       Wed Mar 23 16:59:28 GMT 2022           1    102         0      1