I have a text file which has a number of integer values like this.
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0 4 5 2
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27 34 54 11
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69 66 87 14
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67 82 92 17
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49 41 53 12
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47 29 36 21
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34 27 64 7
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10 7 11 1
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2 6 6 10
I have to make a file by merging several files like this but you guys can see a problem with this data. In 4 and 5 lines, the first values, 1017 and 1106, right next to period index make a problem.
When I try to read these two lines, I always have had this result. It came out that first values in first column next to index columns couldn't recognized as first values themselves.
In [14]: fw.iloc[80,:]
Out[14]:
3 72.0
4 46.0
5 52.0
6 29.0
7 29.0
8 22.0
9 204.0
10 41.0
11 46.0
12 51.0
13 57.0
14 67.0
15 82.0
16 92.0
17 17.0
18 NaN
Name: (20180722, 201807281017), dtype: float64
I tried to make it correct with indexing but failed. The desirable result is,
In [14]: fw.iloc[80,:]
Out[14]:
2 1017.0
3 110.0
4 72.0
5 46.0
6 52.0
7 29.0
8 29.0
9 22.0
10 204.0
11 41.0
12 46.0
13 51.0
14 57.0
15 67.0
16 82.0
17 92.0
18 17.0
Name: (20180722, 201807281017), dtype: float64
How can I solve this problem?
I used this code to read this file.
fw = pd.read_csv('warm_patient.txt', index_col=[0,1], header=None, delim_whitespace=True)
CodePudding user response:
One possibility would be to manually construct the dataframe, this way we can parse the text by splitting the values every 4 characters.
from textwrap import wrap
import pandas as pd
def read_file(f_name):
data = []
with open(f_name) as f:
for line in f.readlines():
idx1 = line[0:8]
idx2 = line[10:18]
points = map(lambda x: int(x.replace(" ", "")), wrap(line.rstrip()[18:], 4))
data.append([idx1, idx2, *points])
return pd.DataFrame(data).set_index([0, 1])
CodePudding user response:
It could be made somewhat more efficient (in particular if this is a particularly long text file), but here's one solution.
fw = pd.read_csv('test.txt', header=None, delim_whitespace=True)
for i in fw[pd.isna(fw.iloc[:,-1])].index:
num_str = str(fw.iat[i,1])
a,b = map(int,[num_str[:-4],num_str[-4:]])
fw.iloc[i,3:] = fw.iloc[i,2:-1]
fw.iloc[i,:3] = [fw.iat[i,0],a,b]
fw = fw.set_index([0,1])
The result of print(fw)
from there is
2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69
20180722 20180728 1017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 20180804 1106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2
16 17 18
0 1
20180701 20180707 4 5 2.0
20180708 20180714 34 54 11.0
20180715 20180721 66 87 14.0
20180722 20180728 82 92 17.0
20180729 20180804 41 53 12.0
20180805 20180811 29 36 21.0
20180812 20180818 27 64 7.0
20180819 20180825 7 11 1.0
20180826 20180901 6 6 10.0
Here's the result of the print after applying your initial solution of fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True)
for comparison.
2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8
15 16 17 18
0 1
20180701 20180707 0 4 5 2.0
20180708 20180714 27 34 54 11.0
20180715 20180721 69 66 87 14.0
20180722 201807281017 82 92 17 NaN
20180729 201808041106 41 53 12 NaN
20180805 20180811 47 29 36 21.0
20180812 20180818 34 27 64 7.0
20180819 20180825 10 7 11 1.0
20180826 20180901 2 6 6 10.0
CodePudding user response:
A better fit for this would be pandas.read_fwf
. For your example:
df = pd.read_fwf(filename, index_col=[0,1], header=None, widths=2*[10] 17*[4])
I don't know if the column widths can be inferred for all your data or need to be hardcoded.