How do I extract the table in the below plaintext as a pandas dataframe in python?-CodePudding

I have a set of texts as below. I want to extract these the table formatted in these plain texts as a pandas dataframe.

What I have:

"data.txt"

Description:           Daily Index of EDGAR Dissemination Feed by Form Type
Last Data Received:    July 01, 1994
Discussion List:       [email protected]
To Subscribe to List:  [email protected]
General Information:   [email protected]
E-mail server:         [email protected]
Anonymous FTP:         ftp://town.hall.org/edgar/daily-index/form.070194.idx

Form Type   Company Name            CIK         Date Filed  File Name 
-----------------------------------------------------------------------------------------
10-C        ABC.                    310254      19940701    file1.txt   
10-B        XYZ.                    121234      54547474    file2.txt   
10-A        LMN.                    346765      12352356    file3.txt

What I want: a pandas dataframe with below structure

Form Type   Company Name     CIK          Date Filed   File Name 
-------------------------------------------------------------------
| 10-C      | ABC          | 310254     | 19940701   | file1.txt  | 
| 10-B      | XYZ          | 121234     | 54547474   | file2.txt  |
| 10-A      | LMN          | 346765     | 12352356   | file3.txt  |
-------------------------------------------------------------------

Below is my code:

test = test.split('\n')
while not re.search('^--*', test[0]): test.pop(0)
test.pop(0)

rows = []
for row in test:
  rows.append(row.split())

print(rows)

I find the first occurrence of the dashed line and then append the rows to a list. Later I convert it to a dataframe. However, I believe there must be a cleaner way to do this, which is why I ask your contribution/support.

Thanks and Happy New Year!! :)

CodePudding user response：

Find the empty line that separates the starting extra lines from the needed table structure, then pass the remaining file buffer to pd.read_table function:

with open('data.txt') as f:
     for line in f:
        line = line.strip()
        if not line:  # find empty line
            break
     df = pd.read_table(f, sep='\s{2,}', header=0, comment='--', engine='python')
     print(df)

Output:

  Form Type Company Name     CIK  Date Filed  File Name
0      10-C         ABC.  310254    19940701  file1.txt
1      10-B         XYZ.  121234    54547474  file2.txt
2      10-A         LMN.  346765    12352356  file3.txt

CodePudding user response：

we have a method pd.read_fwf(), where you can give parameter to skip records from being part of dataframe like - no of lines to skip initial records blank records

If you know in advance from where data will start then use this function to read file.