read position based txt-CodePudding

I have a txt file that I read into a list of strings in which each item of the list is a data sample of 3 variables (A,B,C)

txt = [
     '001   0198110',
     '0020130198110',
     '0030132198110',
]

A separate support dataframe that looks like this

df = pd.DataFrame(data=[[1,3,"A"],[4,3,"B"],[7,6,"C"]],columns=["Position","Lenght","Name"])

df
       Position   Lenght    Name
    0   1         3         A
    1   4         3         B
    2   7         6         C

indicates how I should read the txt. For example the variable A starts at position 1 of each record and its length is 3.

Taking as an example only the first record

'001   0198110'

we have that A=001 B=empty->NaN C=0198110

What would be an efficient way (considering that the txt file is 50k lines and contains 600 variables) to create the expected parsed dataframe

    A   B       C
0   1   NaN     198110
1   2   13      198110
2   3   13     2198110

I can use a for loop, but perhaps there is a better way

CodePudding user response：

Try pd.read_fwf:

from io import StringIO

txt = ["001    198110", "0020130198110", "0030132198110"]

df = pd.DataFrame(
    data=[[1, 3, "A"], [4, 4, "B"], [7, 6, "C"]],
    columns=["Position", "Lenght", "Name"],
)


x = pd.read_fwf(
    StringIO("\n".join(txt)),
    widths=df.Lenght,
    header=None,
)
x.columns = df.Name.to_list()
print(x)

Prints:

   A      B       C
0  1    NaN  198110
1  2  130.0  198110
2  3  132.0  198110

Note: I changes the txt list/df dataframe:

First value of txt to "001 198110"
Length of B to 4 in df

Prints:

CodePudding user response：

It's hard to recommend optimizations without knowing what the data usually looks like. My first two approaches would be those:

If there are a lot of similarities like C, you could look into running a diff first, so you know the ranges where you can copy without rechecking over all the files.

If there is a lot of overlap between variables, you could sort them by starting position and write multiple at the same time, which should also be easier on memory than random accesses