Parsing idx file or string to Pandas DataFrame-CodePudding

I would like to parse the following idx file:

Next I would like to build a parser that is fault tollerance, because it should parse daily a new idx file into pd.DataFrame.

My idea was to use string manipulation, but it would be very complicated and not fault tollerance.

I would be thankful if someone can show the best practice to parse and give a boilerplate code.

CodePudding user response：

Since this is mostly a fixed width file you could use pandas read_fwf to read this file. You can skip over the leading information (via skiprows=) and get straight to the data. The column names are predefined and assigned when read:

idx_path = 'company.20220112.idx'
names = ['Company Name','Form Type','CIK','Date Filed','File Name']
df = pd.read_fwf(idx_path, colspecs=[(0,61),(62,74),(74,84),(86,94),(98,146)], names=names, skiprows=11)
df.head(10)

                                         Company Name Form Type      CIK Date Filed                                    File Name  
0   005 - Series of IPOSharks Venture Master Fund,...         D  1888451   20220112  edgar/data/1888451/0001888451-22-000002.txt  
1           10X Capital Venture Acquisition Corp. III    EFFECT  1848948   20220111  edgar/data/1848948/9999999995-22-000102.txt  
2                              110 White Partners LLC         D  1903845   20220112  edgar/data/1903845/0001884293-22-000001.txt  
3                                       15 Beach, MHC         3  1903509   20220112  edgar/data/1903509/0001567619-22-001073.txt  
4                                       15 Beach, MHC    SC 13D  1903509   20220112  edgar/data/1903509/0000943374-22-000014.txt  
5                                      170 Valley LLC         D  1903913   20220112  edgar/data/1903913/0001903913-22-000001.txt  
6                         1st FRANKLIN FINANCIAL CORP     424B3    38723   20220112    edgar/data/38723/0000038723-22-000003.txt  
7                         1st FRANKLIN FINANCIAL CORP     424B3    38723   20220112    edgar/data/38723/0000038723-22-000004.txt  
8                               215 BF Associates LLC         D  1904145   20220112  edgar/data/1904145/0001904145-22-000001.txt  
9                       2401 Midpoint Drive REIT, LLC         D  1903337   20220112  edgar/data/1903337/0001903337-22-000001.txt