I wonder if there is a way to convert a plain file as shown below
target: locus9_window12
length: 120
miRNA : hsa-miR-4458
length: 19
mfe: -23.7 kcal/mol
p-value: 0.033901
target: locus104_window172
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -26.2 kcal/mol
p-value: 0.015466
target: locus119_window193
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -32.8 kcal/mol
p-value: 0.00028
To a csv type format delimited with comma
target length miRNA length mfe p-value
locus9_window12 120 hsa-miR-4458 19 -23.7 0.033901
locus104_window172 120 hsa-let-7b-5p 22 -26.2 0.015466
locus119_window193 120 hsa-let-7b-5p 22 -32.8 0.00028
If it is possible to transform a plain text file into a comma delimited csv file, I would appreciate any support and contribution
CodePudding user response:
Here is a potential solution using a regex and pandas methods. I rewrote the second length
data field to miRNA_length
(assuming it is the length of the miRNA) to avoid duplicated column names.
with open('filename.txt') as f:
t = f.read()
import re
df = (pd.DataFrame(re.findall(r'([^\s:] )\s*: (\S*)', t), columns=['col', 'value'])
# rename the length field that follows miRNA
.assign(col=lambda d: d['col'].mask(d['col'].shift().eq('miRNA'), 'miRNA_length'))
# group the data by row
.assign(index=lambda d: d.groupby('col').cumcount())
# reshape to wide format
.pivot(index='index', columns='col', values='value')
.rename_axis(index=None, columns=None)
# convert the data types (e.g numeric)
.convert_dtypes()
)
output:
length mfe miRNA miRNA_length p-value target
0 120 -23.7 hsa-miR-4458 19 0.033901 locus9_window12
1 120 -26.2 hsa-let-7b-5p 22 0.015466 locus104_window172
2 120 -32.8 hsa-let-7b-5p 22 0.00028 locus119_window193
input text if not file:
t = '''target: locus9_window12
length: 120
miRNA : hsa-miR-4458
length: 19
mfe: -23.7 kcal/mol
p-value: 0.033901
target: locus104_window172
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -26.2 kcal/mol
p-value: 0.015466
target: locus119_window193
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -32.8 kcal/mol
p-value: 0.00028
'''
Saving as CSV:
df.to_csv('out.csv') # check the doc for more options
CodePudding user response:
miller is quite good for this kind of file formatting. The input file needs a little bit of tweaking, to add a blank line between each record, and to remove the colon
awk -F: 'NR > 1 && $1 == "target" {print ""}; {sub(/:/,""); print}' file \
| mlr --ixtab --ocsv cat
outputs
target,length,miRNA,mfe,p-value
locus9_window12,19,hsa-miR-4458,-23.7 kcal/mol,0.033901
locus104_window172,22,hsa-let-7b-5p,-26.2 kcal/mol,0.015466
locus119_window193,22,hsa-let-7b-5p,-32.8 kcal/mol,0.00028
CodePudding user response:
Assuming there are no commas in your text file and no spaces in the fields you want to print, GNU awk can help:
# foo.awk
BEGIN {
print "length,mfe,miRNA,miRNA_length,p-value,target"
}
{
fields[NR%6] = $NF
}
NR%6 == 0 {
for(i=1; i<=6; i ) printf("%s%c", fields[i%6], i==6 ? "\n" : OFS)
}
And then:
awk -v OFS=, -f foo.awk foo.txt
length,mfe,miRNA,miRNA_length,p-value,target
locus9_window12,120,hsa-miR-4458,19,kcal/mol,0.033901
locus104_window172,120,hsa-let-7b-5p,22,kcal/mol,0.015466
locus119_window193,120,hsa-let-7b-5p,22,kcal/mol,0.00028
Explanations: we populate the fields
array with the last field of each line ($NF
) at index "line number modulo 6" (NR%6
). Note that line numbers starts at 1, so the last of each group of 6 is indexed 0 in the array, not 6. If the current record number is a multiple of 6 we print the content of the fields
array. The output field separator is set to comma (-v OFS=,
).