I have a set of texts as below. I want to extract these the table formatted in these plain texts as a pandas dataframe.
What I have:
"data.txt"
Description: Daily Index of EDGAR Dissemination Feed by Form Type
Last Data Received: July 01, 1994
Discussion List: [email protected]
To Subscribe to List: [email protected]
General Information: [email protected]
E-mail server: [email protected]
Anonymous FTP: ftp://town.hall.org/edgar/daily-index/form.070194.idx
Form Type Company Name CIK Date Filed File Name
-----------------------------------------------------------------------------------------
10-C ABC. 310254 19940701 file1.txt
10-B XYZ. 121234 54547474 file2.txt
10-A LMN. 346765 12352356 file3.txt
What I want: a pandas dataframe with below structure
Form Type Company Name CIK Date Filed File Name
-------------------------------------------------------------------
| 10-C | ABC | 310254 | 19940701 | file1.txt |
| 10-B | XYZ | 121234 | 54547474 | file2.txt |
| 10-A | LMN | 346765 | 12352356 | file3.txt |
-------------------------------------------------------------------
Below is my code:
test = test.split('\n')
while not re.search('^--*', test[0]): test.pop(0)
test.pop(0)
rows = []
for row in test:
rows.append(row.split())
print(rows)
I find the first occurrence of the dashed line and then append the rows to a list. Later I convert it to a dataframe. However, I believe there must be a cleaner way to do this, which is why I ask your contribution/support.
Thanks and Happy New Year!! :)
CodePudding user response:
Find the empty line that separates the starting extra lines from the needed table structure, then pass the remaining file buffer to pd.read_table
function:
with open('data.txt') as f:
for line in f:
line = line.strip()
if not line: # find empty line
break
df = pd.read_table(f, sep='\s{2,}', header=0, comment='--', engine='python')
print(df)
Output:
Form Type Company Name CIK Date Filed File Name
0 10-C ABC. 310254 19940701 file1.txt
1 10-B XYZ. 121234 54547474 file2.txt
2 10-A LMN. 346765 12352356 file3.txt
CodePudding user response:
we have a method pd.read_fwf(), where you can give parameter to skip records from being part of dataframe like - no of lines to skip initial records blank records
If you know in advance from where data will start then use this function to read file.