Slice each string row of pandas dataframe according a pattern array-CodePudding

I need to slice a string row of pandas in different positions, and I want to use vectorization for that. Some one can help me? Each line has this pattern:

012016010402AAPL34      010APPLE       DRN          R$  000000000415000000000042200000000004150000000000421300000000042080000000003950000000000435000005000000000000012500000000000052664400000000000000009999123100000010000000000000BRAAPLBDR004115

This line has 26 diferent information separated by chars, for example:

['01', '2016/01/04', '02', 'AAPL34      ', ...,'115']

The string position os each data is define by this array: [0,2,10,12,24,27,39,49,52,56,69,82,95,108,121,134,147,152,170,188,201,202,210,217,230,242,245]

I tried unsuccessfully to use this function with the dataframe:

def row_slice(s,indices):
    return pd.Series([s[i:j] for i,j in zip(indices, indices[1:] [None])])

The data that I'm using can be downloaded by this link:

Someone can help me?

CodePudding user response：

It looks like you need pandas.read_fwf, reading your file directly:

l = [0,2,10,12,24,27,39,49,52,56,69,82,95,108,121,134,147,152,170,188,201,202,210,217,230,242,245]

import numpy as np
df = pd.read_fwf('filename', widths=np.diff(l), header=None)

Output:

   0         1   2       3   4      5    6   7   8     9   ...  16     17  \
0   1  20160104   2  AAPL34  10  APPLE  DRN NaN  R$  4150  ...   5  12500   

         18             19  20        21       22             23  \
0  52664400  000000000000d   i  fferent3  1000000  1000000000000   

             24   25  
0  0BRAAPLBDR00  411  

[1 rows x 26 columns]

With leading zeros (as string), add the dtype=str parameter:

   0         1   2       3    4      5    6    7   8              9   ...  \
0  01  20160104  02  AAPL34  010  APPLE  DRN  NaN  R$  0000000004150  ...   

      16                  17                  18             19 20        21  \
0  00005  000000000000012500  000000000052664400  000000000000d  i  fferent3   

        22             23            24   25  
0  1000000  1000000000000  0BRAAPLBDR00  411  

[1 rows x 26 columns]

CodePudding user response：

Use something similar to this:

df['col name'].str.extractall('(\d{2})(\d{8})(\d{2})([A-Z]{4}\d{2})')

Output:

0   1   2   3
match               
0   0   01  20160104    02  AAPL34

CodePudding user response：

Here is a way to split the string into a list of substrings according to indices:

lst = [s[indices[i]:indices[i 1]].strip() for i in range(len(indices) - 1)]

Output:

['01', '20160104', '02', 'AAPL34', '010', 'APPLE', 'DRN', '', 'R$', '0000000004150', '0000000004220', '0000000004150', '0000000004213', '0000000004208', '0000000003950', '0000000004350', '00005', '000000000000012500', '000000000052664400', '000000000000d', 'i', 'fferent3', '1000000', '1000000000000', '0BRAAPLBDR00', '411']