I have a text file that needs to be converted into CSV file using pandas. A piece of it is presented in the following:
time 00:15 min
cod,10,1=0,2=2,3=2,4=1,5=6,6=4,7=2,8=7,9=1,10=9,11=7
cod,18,1=27,2=18,3=19,4=20,5=47,6=2,7=2,8=0,9=33,10=61,11=13,12=2,13=3,14=0,15=0
Rows are cod,10, and cod,18 and the columns are 1, 2, 3,..., 15. Any idea? Regards, Ali
CodePudding user response:
I use pandas to deal with the conversion, but vanilla Python to deal with some of aspects of the data, I hope that is alright.
One issue we need to deal with is the fact that there are a different number of columns per row. So I just put NaN in columns that are missing for a row. For instance, row 1 is shorter than row 2, so the missing columns in row 1 are given values as "NaN".
Here is my idea:
import pandas as pd
lines = []
with open('/path/to/test.txt', 'r') as infile:
for line in infile:
if "," not in line:
continue
else:
lines.append(line.strip().split(","))
row_names = []
column_data = {}
max_length = max(*[len(line) for line in lines])
for line in lines:
while(len(line) < max_length):
line.append(f'{len(line)-1}=NaN')
for line in lines:
row_names.append(" ".join(line[:2]))
for info in line[2:]:
(k,v) = info.split("=")
if k in column_data:
column_data[k].append(v)
else:
column_data[k] = [v]
df = pd.DataFrame(column_data)
df.index = row_names
print(df)
df.to_csv('/path/to/test.csv')
Output (the printed DataFrame):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cod 10 0 2 2 1 6 4 2 7 1 9 7 NaN NaN NaN NaN
cod 18 27 18 19 20 47 2 2 0 33 61 13 2 3 0 0
CSV File Output:
,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
cod 10,0,2,2,1,6,4,2,7,1,9,7,NaN,NaN,NaN,NaN
cod 18,27,18,19,20,47,2,2,0,33,61,13,2,3,0,0
CodePudding user response:
You can use Python standard CSV module and its DictWriter class to handle the variability in the column names:
import csv
headers = {'Cod': None, 'Num': None} # other column names will be added
all_rows = []
with open('input.txt') as f:
next(f) # discard first "time" line
# Iterate text lines
for line in f:
line = line.strip()
cod, num, *pairs = line.split(',')
row = {'Cod': cod, 'Num': num} # create new row, and fill it in...
for pair in pairs:
pair = pair.strip()
col_name, value = pair.split('=')
row[col_name] = value
# Make sure headers has this column's name
headers[col_name] = None
all_rows.append(row)
with open('output.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=headers)
writer.writeheader()
writer.writerows(all_rows)
This gives me the following output:
Cod,Num,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
cod,10,0,2,2,1,6,4,2,7,1,9,7,,,,
cod,18,27,18,19,20,47,2,2,0,33,61,13,2,3,0,0
| Cod | Num | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|-----|-----|----|----|----|----|----|---|---|---|----|----|----|----|----|----|----|
| cod | 10 | 0 | 2 | 2 | 1 | 6 | 4 | 2 | 7 | 1 | 9 | 7 | | | | |
| cod | 18 | 27 | 18 | 19 | 20 | 47 | 2 | 2 | 0 | 33 | 61 | 13 | 2 | 3 | 0 | 0 |
If you need cod
and num
joined into a single column, say, id
, make these changes:
headers = {'id': None} # <<< changed line
...
cod, num, *pairs = line.split(',')
id_ = cod ' ' num # <<< new line
row = {'id': id_} # <<< changed line
for pair in pairs:
and now I get:
id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
cod 10,0,2,2,1,6,4,2,7,1,9,7,,,,
cod 18,27,18,19,20,47,2,2,0,33,61,13,2,3,0,0
| id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|--------|----|----|----|----|----|---|---|---|----|----|----|----|----|----|----|
| cod 10 | 0 | 2 | 2 | 1 | 6 | 4 | 2 | 7 | 1 | 9 | 7 | | | | |
| cod 18 | 27 | 18 | 19 | 20 | 47 | 2 | 2 | 0 | 33 | 61 | 13 | 2 | 3 | 0 | 0 |