Home > Enterprise >  Pandas read file with no delimiter and with different column widths
Pandas read file with no delimiter and with different column widths

Time:07-02

I want to read a plaintext file using pandas. I have entries without delimiters and with different widths like this:

59967Y98Doe John            6211100004545SO20140314-  00024278
N0546664SCHMIDT-PETER       7441100008300AW20140314-  00023643
G4894jmhTAKLONSKY-JUERGEN   4211100005000TB20140315   00023882
34875738PODESBERG-SCHUMPERTS6211100003671SO20140315   00024622
  • 1-8 is a string.
  • 9-28 is a string.
  • 29-31 is numeric.
  • 32-34 is numeric.
  • 35-41 is numeric.
  • 42-43 is a string.
  • 44-51 is a date (yyyyMMdd).
  • 52 is minus or a blank
  • Rest is a currency amount without a decimal point (the last 2 digits is always after the decimal point). For example: - 00024278 = -242.78 €

I know there is pd.read_fwf

There is an argument width. I could do this:

pd.read_fwf(StringIO(txt), widths=[8], header="Peronal Nr.")

But how could I read my file with different columns widths?

CodePudding user response:

As the s in widths suggest, you can pass a list of widths:

pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None)

output:

          0                     1    2    3     4   5         6    7      8
0  59967Y98              Doe John  621  110  4545  SO  20140314    -  24278
1  N0546664         SCHMIDT-PETER  744  110  8300  AW  20140314    -  23643
2  G4894jmh     TAKLONSKY-JUERGEN  421  110  5000  TB  20140315  NaN  23882
3  34875738  PODESBERG-SCHUMPERTS  621  110  3671  SO  20140315  NaN  24622

If you want names and dtypes:

df = (pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None,
                  names=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
                  dtypes=[str, str, int, int, int, str, str, str, int])
        .assign(**{'G': lambda d: pd.to_datetime(d['G'], format='%Y%m%d')})
     )

output:

          A                     B    C    D     E   F          G    H      I
0  59967Y98              Doe John  621  110  4545  SO 2014-03-14    -  24278
1  N0546664         SCHMIDT-PETER  744  110  8300  AW 2014-03-14    -  23643
2  G4894jmh     TAKLONSKY-JUERGEN  421  110  5000  TB 2014-03-15  NaN  23882
3  34875738  PODESBERG-SCHUMPERTS  621  110  3671  SO 2014-03-15  NaN  24622

df.dtypes
A            object
B            object
C             int64
D             int64
E             int64
F            object
G    datetime64[ns]
H            object
I             int64
dtype: object
  • Related