I have a txt file that I read into a list of strings in which each item of the list is a data sample of 3 variables (A,B,C)
txt = [
'001 0198110',
'0020130198110',
'0030132198110',
]
A separate support dataframe that looks like this
df = pd.DataFrame(data=[[1,3,"A"],[4,3,"B"],[7,6,"C"]],columns=["Position","Lenght","Name"])
df
Position Lenght Name
0 1 3 A
1 4 3 B
2 7 6 C
indicates how I should read the txt. For example the variable A starts at position 1 of each record and its length is 3.
Taking as an example only the first record
'001 0198110'
we have that A=001
B=empty->NaN
C=0198110
What would be an efficient way (considering that the txt file is 50k lines and contains 600 variables) to create the expected parsed dataframe
A B C
0 1 NaN 198110
1 2 13 198110
2 3 13 2198110
I can use a for loop, but perhaps there is a better way
CodePudding user response:
Try pd.read_fwf
:
from io import StringIO
txt = ["001 198110", "0020130198110", "0030132198110"]
df = pd.DataFrame(
data=[[1, 3, "A"], [4, 4, "B"], [7, 6, "C"]],
columns=["Position", "Lenght", "Name"],
)
x = pd.read_fwf(
StringIO("\n".join(txt)),
widths=df.Lenght,
header=None,
)
x.columns = df.Name.to_list()
print(x)
Prints:
A B C
0 1 NaN 198110
1 2 130.0 198110
2 3 132.0 198110
Note: I changes the txt
list/df
dataframe:
- First value of
txt
to"001 198110"
- Length of
B
to4
indf
Prints:
CodePudding user response:
It's hard to recommend optimizations without knowing what the data usually looks like. My first two approaches would be those:
If there are a lot of similarities like C, you could look into running a diff first, so you know the ranges where you can copy without rechecking over all the files.
If there is a lot of overlap between variables, you could sort them by starting position and write multiple at the same time, which should also be easier on memory than random accesses