I have such data in a txt file:
Wed Mar 23 16:59:25 GMT 2022
1 State
1 ESTAB
Wed Mar 23 16:59:26 GMT 2022
1 State
1 ESTAB
1 CLOSE-WAIT
Wed Mar 23 16:59:27 GMT 2022
1 State
1 ESTAB
10 FIN-WAIT
Wed Mar 23 16:59:28 GMT 2022
1 State
1 CLOSE-WAIT
102 ESTAB
I want to get a pandas dataframe looking like this:
timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1
That means the string in the first line per paragraph should be used for the first column timestamp
. The other columns should be filled withg the numbers according to the string following the number. The next column begins after a paragraph.
How can I do this with pandas?
CodePudding user response:
First you can process the txt file to a list of list. Inner list means each hunk lines. Outer list means different hunks:
import pandas as pd
with open('data.txt', 'r') as f:
res = f.read()
records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]
print(records)
[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]
Then you can turn the list of list to list of dictionary by manually define each key and value
l = []
for record in records:
d = {}
d['timestamp'] = record[0]
for r in record[1:]:
key = r.split(' ')[1]
value = r.split(' ')[0]
d[key] = value
l.append(d)
print(l)
[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]
At last you can feed this dictionary into dataframe and fill the nan cell
df = pd.DataFrame(l).fillna(0)
print(df)
timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0 Wed Mar 23 16:59:25 GMT 2022 1 1 0 0
1 Wed Mar 23 16:59:26 GMT 2022 1 1 1 0
2 Wed Mar 23 16:59:27 GMT 2022 1 1 0 10
3 Wed Mar 23 16:59:28 GMT 2022 1 102 1 0
CodePudding user response:
Try:
#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)
#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")
#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)
df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()
#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)
#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()
>>> output
Column timestamp CLOSE-WAIT ESTAB FIN-WAIT State
0 Wed Mar 23 16:59:25 GMT 2022 0 1 0 1
1 Wed Mar 23 16:59:26 GMT 2022 1 1 0 1
2 Wed Mar 23 16:59:27 GMT 2022 0 1 10 1
3 Wed Mar 23 16:59:28 GMT 2022 1 102 0 1