how to split an integer value from one column to two columns in text file using pandas or numpy (pyt-CodePudding

I have a text file which has a number of integer values like this.

  20180701  20180707  52  11   1   2   4   1   0   0  10   7   1   3   1   0   4   5   2
  20180708  20180714 266   8  19   3   2   9   7  25  20  17  12   9   9  27  34  54  11
  20180715  20180721 654  52  34  31  20  16  12  25  84  31  38  37  38  69  66  87  14
  20180722  201807281017 110  72  46  52  29  29  22 204  41  46  51  57  67  82  92  17
  20180729  201808041106 276  37  11  87  20  10   8 284  54  54  72  38  49  41  53  12
  20180805  20180811 624  78  19  15  55  16   8   9 172  15  31  35  38  47  29  36  21
  20180812  20180818 488  63  17   7  26  10   9   7 116  17  14  39  31  34  27  64   7
  20180819  20180825  91   4   7   0   4   5   1   3  16   3   4   5  10  10   7  11   1
  20180826  20180901  49   2   2   1   0   4   0   1   2   0   1   4   8   2   6   6  10

I have to make a file by merging several files like this but you guys can see a problem with this data. In 4 and 5 lines, the first values, 1017 and 1106, right next to period index make a problem.

When I try to read these two lines, I always have had this result. It came out that first values in first column next to index columns couldn't recognized as first values themselves.

In [14]: fw.iloc[80,:]
Out[14]: 
3      72.0
4      46.0
5      52.0
6      29.0
7      29.0
8      22.0
9     204.0
10     41.0
11     46.0
12     51.0
13     57.0
14     67.0
15     82.0
16     92.0
17     17.0
18      NaN
Name: (20180722, 201807281017), dtype: float64

I tried to make it correct with indexing but failed. The desirable result is,

In [14]: fw.iloc[80,:]
Out[14]: 
2    1017.0
3     110.0
4      72.0
5      46.0
6      52.0
7      29.0
8      29.0
9      22.0
10    204.0
11     41.0
12     46.0
13     51.0
14     57.0
15     67.0
16     82.0
17     92.0
18     17.0
Name: (20180722, 201807281017), dtype: float64

How can I solve this problem?

I used this code to read this file.

fw = pd.read_csv('warm_patient.txt', index_col=[0,1], header=None, delim_whitespace=True)

CodePudding user response：

One possibility would be to manually construct the dataframe, this way we can parse the text by splitting the values every 4 characters.

from textwrap import wrap

import pandas as pd


def read_file(f_name):
    data = []
    with open(f_name) as f:
        for line in f.readlines():
            idx1 = line[0:8]
            idx2 = line[10:18]
            points = map(lambda x: int(x.replace(" ", "")), wrap(line.rstrip()[18:], 4))
            data.append([idx1, idx2, *points])
    return pd.DataFrame(data).set_index([0, 1])

CodePudding user response：

It could be made somewhat more efficient (in particular if this is a particularly long text file), but here's one solution.

fw = pd.read_csv('test.txt',  header=None, delim_whitespace=True)
for i in fw[pd.isna(fw.iloc[:,-1])].index:
    num_str = str(fw.iat[i,1])
    a,b = map(int,[num_str[:-4],num_str[-4:]])
    fw.iloc[i,3:] = fw.iloc[i,2:-1]
    fw.iloc[i,:3] = [fw.iat[i,0],a,b]
fw = fw.set_index([0,1])

The result of print(fw) from there is

                     2    3   4   5   6   7   8   9    10  11  12  13  14  15  \
0        1                                                                      
20180701 20180707    52   11   1   2   4   1   0   0   10   7   1   3   1   0   
20180708 20180714   266    8  19   3   2   9   7  25   20  17  12   9   9  27   
20180715 20180721   654   52  34  31  20  16  12  25   84  31  38  37  38  69   
20180722 20180728  1017  110  72  46  52  29  29  22  204  41  46  51  57  67   
20180729 20180804  1106  276  37  11  87  20  10   8  284  54  54  72  38  49   
20180805 20180811   624   78  19  15  55  16   8   9  172  15  31  35  38  47   
20180812 20180818   488   63  17   7  26  10   9   7  116  17  14  39  31  34   
20180819 20180825    91    4   7   0   4   5   1   3   16   3   4   5  10  10   
20180826 20180901    49    2   2   1   0   4   0   1    2   0   1   4   8   2   

                   16  17    18  
0        1                       
20180701 20180707   4   5   2.0  
20180708 20180714  34  54  11.0  
20180715 20180721  66  87  14.0  
20180722 20180728  82  92  17.0  
20180729 20180804  41  53  12.0  
20180805 20180811  29  36  21.0  
20180812 20180818  27  64   7.0  
20180819 20180825   7  11   1.0  
20180826 20180901   6   6  10.0

Here's the result of the print after applying your initial solution of fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True) for comparison.

                        2   3   4   5   6   7   8    9    10  11  12  13  14  \
0        1                                                                     
20180701 20180707       52  11   1   2   4   1   0    0   10   7   1   3   1   
20180708 20180714      266   8  19   3   2   9   7   25   20  17  12   9   9   
20180715 20180721      654  52  34  31  20  16  12   25   84  31  38  37  38   
20180722 201807281017  110  72  46  52  29  29  22  204   41  46  51  57  67   
20180729 201808041106  276  37  11  87  20  10   8  284   54  54  72  38  49   
20180805 20180811      624  78  19  15  55  16   8    9  172  15  31  35  38   
20180812 20180818      488  63  17   7  26  10   9    7  116  17  14  39  31   
20180819 20180825       91   4   7   0   4   5   1    3   16   3   4   5  10   
20180826 20180901       49   2   2   1   0   4   0    1    2   0   1   4   8   

                       15  16  17    18  
0        1                               
20180701 20180707       0   4   5   2.0  
20180708 20180714      27  34  54  11.0  
20180715 20180721      69  66  87  14.0  
20180722 201807281017  82  92  17   NaN  
20180729 201808041106  41  53  12   NaN  
20180805 20180811      47  29  36  21.0  
20180812 20180818      34  27  64   7.0  
20180819 20180825      10   7  11   1.0  
20180826 20180901       2   6   6  10.0

CodePudding user response：

A better fit for this would be pandas.read_fwf. For your example:

df = pd.read_fwf(filename, index_col=[0,1], header=None, widths=2*[10] 17*[4])

I don't know if the column widths can be inferred for all your data or need to be hardcoded.