I need to slice a string row of pandas in different positions, and I want to use vectorization for that. Some one can help me? Each line has this pattern:
012016010402AAPL34 010APPLE DRN R$ 000000000415000000000042200000000004150000000000421300000000042080000000003950000000000435000005000000000000012500000000000052664400000000000000009999123100000010000000000000BRAAPLBDR004115
This line has 26 diferent information separated by chars, for example:
['01', '2016/01/04', '02', 'AAPL34 ', ...,'115']
The string position os each data is define by this array: [0,2,10,12,24,27,39,49,52,56,69,82,95,108,121,134,147,152,170,188,201,202,210,217,230,242,245]
I tried unsuccessfully to use this function with the dataframe:
def row_slice(s,indices):
return pd.Series([s[i:j] for i,j in zip(indices, indices[1:] [None])])
The data that I'm using can be downloaded by this link:
Someone can help me?
CodePudding user response:
It looks like you need pandas.read_fwf
, reading your file directly:
l = [0,2,10,12,24,27,39,49,52,56,69,82,95,108,121,134,147,152,170,188,201,202,210,217,230,242,245]
import numpy as np
df = pd.read_fwf('filename', widths=np.diff(l), header=None)
Output:
0 1 2 3 4 5 6 7 8 9 ... 16 17 \
0 1 20160104 2 AAPL34 10 APPLE DRN NaN R$ 4150 ... 5 12500
18 19 20 21 22 23 \
0 52664400 000000000000d i fferent3 1000000 1000000000000
24 25
0 0BRAAPLBDR00 411
[1 rows x 26 columns]
With leading zeros (as string), add the dtype=str
parameter:
0 1 2 3 4 5 6 7 8 9 ... \
0 01 20160104 02 AAPL34 010 APPLE DRN NaN R$ 0000000004150 ...
16 17 18 19 20 21 \
0 00005 000000000000012500 000000000052664400 000000000000d i fferent3
22 23 24 25
0 1000000 1000000000000 0BRAAPLBDR00 411
[1 rows x 26 columns]
CodePudding user response:
Use something similar to this:
df['col name'].str.extractall('(\d{2})(\d{8})(\d{2})([A-Z]{4}\d{2})')
Output:
0 1 2 3
match
0 0 01 20160104 02 AAPL34
CodePudding user response:
Here is a way to split the string into a list of substrings according to indices
:
lst = [s[indices[i]:indices[i 1]].strip() for i in range(len(indices) - 1)]
Output:
['01', '20160104', '02', 'AAPL34', '010', 'APPLE', 'DRN', '', 'R$', '0000000004150', '0000000004220', '0000000004150', '0000000004213', '0000000004208', '0000000003950', '0000000004350', '00005', '000000000000012500', '000000000052664400', '000000000000d', 'i', 'fferent3', '1000000', '1000000000000', '0BRAAPLBDR00', '411']