Home > Software design >  How can I read and parse files with variant spaces as delim?
How can I read and parse files with variant spaces as delim?

Time:05-12

I need help solving this problem:

I have a directory full of .txt files that look like this:

file1.no
file2.no
file3.no

And every file has the following structure (I only care for the first two "columns" in the .txt):


#POS SEQ  SCORE    QQ-INTERVAL     STD      MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is: 
#LL=-4797.62
    1     M  0.3821   [0.01331,0.5465]  0.4421    7/7
    2     E  0.4508   [0.05393,0.6788]  0.5331    7/7
    3     L  0.5334   [0.05393,0.6788]  0.6279    7/7
    4     G  0.5339   [0.05393,0.6788]   0.624    7/7

And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).

But now I am running into two issues:

  1. How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files

  2. How to parse this files, giving that the spaces between the columns vary for almost all of them.

My output should look like this:

|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|

So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.

Any help is welcome, thanks in advance.

CodePudding user response:

Here is an example code that should work for you:

using DataFrames

function parsefile(filename)
    l = readlines(filename)
    filter!(x -> !startswith(x, "#"), l)
    sl = split.(l)
    return (File=filename,
            SEQ=join(getindex.(sl, 2)),
            SCORE=parse.(Float64, getindex.(sl, 3)))
end

df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])

your result will be in df data frame.

  • Related