Use list items as column seperators pd.read

I have text files containing tables which I want to put into a dataframe. Per file the column headers are the same, but the width is different depending on the content (because they contain names of different lengths for example).

So far I managed to get the index of the first character of the header, so I know where the column separation should be. As all files (about 100) have different widths I want to be able to insert this list in order to create a dataframe with the correct column width.

This is what the first two rows look like:

!Column1        !Column2     !Column3       !Column4     !Column5                  
Company with a  $1,000,000   Yes            Jack, Hank   X
name            
Company with.   $2,000       No             Rita, Hannah X
another name

What I tried so far:

(1)

pandas pd.read_fwf('file.txt', colspec(()) - This does work, but with colspec I have to put in the (from, start) indexes for each column. Not only would this be burdensome manually, but some files have 12 columns while others have 10.

(2)

pandas pd.read_fwf('file.txt', widhts(list)) - Here I can easily insert the list with locations of the column separations, but it does not seem to create a separation at those indexes. I do not exactly understand what is does.

Question:

I currently have a list of indexes of all the exclamation marks:

list = [0, 17, 30, 45, 58]

How can I use this list and separate the columns to convert the .txt file into a DataFrame?

Any other way to solve this issue is more than welcome!

CodePudding user response：

You can use zip() to convert list [0, 17-1, 30-1, 45-1] to pairs [(0, 16), (16, 29), (29, 44), (44, 57)]

cols = [0, 17-1, 30-1, 45-1]

#pairs = [(a, b) for a, b in zip(cols, cols[1:])]
pairs = list(zip(cols, cols[1:]))

df = pd.read_fwf('file.txt', pairs)

Minimale working code.

I use io to simulate file in memory - so everyone can simply copy and test it - but you should use filename

text = '''!Column1        !Column2     !Column3       !Column4     !Column5                  
Company with a  $1,000,000   Yes            Jack, Hank   X
name            
Company with.   $2,000       No             Rita, Hannah X
another name'''

import pandas as pd
import io

#cols = [0, 17-1, 30-1, 45-1]

# --- search all `!` in first line ----

#with open('file.txt') as fh:
with io.StringIO(text) as fh:
   first = next(fh)

start = 0
cols = []
while True:
    pos = first.find('!', start)
    if pos < 0:
        break
    cols.append(pos)
    start = pos 1
print('cols:', cols)    
cols.append(1000)  # add some big value as the end of last column

# --- convert to pairs ---

#pairs = [(a, b) for a, b in zip(cols, cols[1:])]
pairs = list(zip(cols, cols[1:]))
print('pairs:', pairs)

# --- load file ---

#df = pd.read_fwf('file.txt', pairs)
df = pd.read_fwf(io.StringIO(text), pairs)

print(df)

Result:

         !Column1    !Column2 !Column3      !Column4 !Column5
0  Company with a  $1,000,000      Yes    Jack, Hank        X
1            name         NaN      NaN           NaN      NaN
2   Company with.      $2,000       No  Rita, Hannah        X
3    another name         NaN      NaN           NaN      NaN

But this shows other problem - first column has values in two rows but pandas treats it as separated rows.

CodePudding user response：

So what you can do is standardize the spacing with regex.

import re
string = "something    something  something  more"
results = re.sub("(\W )", "|", string)
results

That returns

'something|something|something|more'

If you have standardized the delimiters, you can load it with fwf or just read_csv.

EDIT

In order to derive the span of the header that is delimited with a exclamation mark ! you can use the re library too. The logic of the pattern is that the sequence has to start with ! and then is followed up by many non-!. The next group would inherently start with a !. In code it would look something like this:

example_txt = """!Column1        !Column2     !Column3       !Column4     !Column5                  
Company with a  $1,000,000   Yes            Jack, Hank   X
name            
Company with.   $2,000       No             Rita, Hannah X
another name"""

first_line = example_txt.split("\n")[0]

import re 

indexes = []
p = re.compile("![^!]*")
for m in p.finditer(first_line):
    indexes.append(m.span())

print(indexes)

Which returns

[(0, 16), (16, 29), (29, 44), (44, 57), (57, 83)]

This should bring you close to what you need for fwf method of pandas. Not that indexing in python starts at 0 and that if the end-index doesn't count. So if you index from [0:16] then you would get the 0th to 15th element (not including the 16th element), returning 16 elements in total. The index can therefore be directly applied.