Elaborate and structure a csv file from a plain text file-CodePudding

I wonder if there is a way to convert a plain file as shown below

target: locus9_window12
length: 120
miRNA : hsa-miR-4458
length: 19
mfe: -23.7 kcal/mol
p-value: 0.033901
target: locus104_window172
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -26.2 kcal/mol
p-value: 0.015466
target: locus119_window193
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -32.8 kcal/mol
p-value: 0.00028

To a csv type format delimited with comma

target                length   miRNA           length   mfe      p-value
locus9_window12       120      hsa-miR-4458     19      -23.7    0.033901 
locus104_window172    120      hsa-let-7b-5p    22      -26.2    0.015466
locus119_window193    120      hsa-let-7b-5p    22      -32.8    0.00028

If it is possible to transform a plain text file into a comma delimited csv file, I would appreciate any support and contribution

CodePudding user response：

Here is a potential solution using a regex and pandas methods. I rewrote the second length data field to miRNA_length (assuming it is the length of the miRNA) to avoid duplicated column names.

with open('filename.txt') as f:
    t = f.read()

import re

df = (pd.DataFrame(re.findall(r'([^\s:] )\s*: (\S*)', t), columns=['col', 'value'])
        # rename the length field that follows miRNA
        .assign(col=lambda d: d['col'].mask(d['col'].shift().eq('miRNA'), 'miRNA_length'))
        # group the data by row
        .assign(index=lambda d: d.groupby('col').cumcount())
        # reshape to wide format
        .pivot(index='index', columns='col', values='value')
        .rename_axis(index=None, columns=None)
        # convert the data types (e.g numeric)
        .convert_dtypes()
      )

output:

  length    mfe          miRNA miRNA_length   p-value              target
0    120  -23.7   hsa-miR-4458           19  0.033901     locus9_window12
1    120  -26.2  hsa-let-7b-5p           22  0.015466  locus104_window172
2    120  -32.8  hsa-let-7b-5p           22   0.00028  locus119_window193

input text if not file:

t = '''target: locus9_window12
length: 120
miRNA : hsa-miR-4458
length: 19
mfe: -23.7 kcal/mol
p-value: 0.033901
target: locus104_window172
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -26.2 kcal/mol
p-value: 0.015466
target: locus119_window193
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -32.8 kcal/mol
p-value: 0.00028
'''

Saving as CSV:

df.to_csv('out.csv') # check the doc for more options

CodePudding user response：

miller is quite good for this kind of file formatting. The input file needs a little bit of tweaking, to add a blank line between each record, and to remove the colon

awk -F: 'NR > 1 && $1 == "target" {print ""}; {sub(/:/,""); print}' file \
| mlr --ixtab --ocsv cat

outputs

target,length,miRNA,mfe,p-value
locus9_window12,19,hsa-miR-4458,-23.7 kcal/mol,0.033901
locus104_window172,22,hsa-let-7b-5p,-26.2 kcal/mol,0.015466
locus119_window193,22,hsa-let-7b-5p,-32.8 kcal/mol,0.00028

CodePudding user response：

Assuming there are no commas in your text file and no spaces in the fields you want to print, GNU awk can help:

# foo.awk
BEGIN {
  print "length,mfe,miRNA,miRNA_length,p-value,target"
}
{
  fields[NR%6] = $NF
}
NR%6 == 0 {
  for(i=1; i<=6; i  ) printf("%s%c", fields[i%6], i==6 ? "\n" : OFS)
}

And then:

awk -v OFS=, -f foo.awk foo.txt
length,mfe,miRNA,miRNA_length,p-value,target
locus9_window12,120,hsa-miR-4458,19,kcal/mol,0.033901
locus104_window172,120,hsa-let-7b-5p,22,kcal/mol,0.015466
locus119_window193,120,hsa-let-7b-5p,22,kcal/mol,0.00028

Explanations: we populate the fields array with the last field of each line ($NF) at index "line number modulo 6" (NR%6). Note that line numbers starts at 1, so the last of each group of 6 is indexed 0 in the array, not 6. If the current record number is a multiple of 6 we print the content of the fields array. The output field separator is set to comma (-v OFS=,).