Home > Software engineering >  How to slice strings in dataframe based on string length of column in Python?
How to slice strings in dataframe based on string length of column in Python?

Time:04-14

The problem i want to solve is: Use Len() on a column and the number of characters for each row needs to be applied to another column.

I have a dataframe with general ledger codes that don't have the same length and i need to find the lowest level of detail to prevent double counting. The way i can find it is by comparing the digits of the current row with the next row using the number of characters of the current row. For example, 11.0 and 111.0 are grouped accounts of 1111-1123. I only want 111-1123 and exclude the group accounts.

I can use the LEN function to get the number of characters of the current row, but i am not able to apply this for the entire column.

My dataframe looks like this:

:
df3

    Account    Amount 
0      11.0   1000.82         
1     111.0   1000.42         
2    1111.0    791.51       
3    1115.0   1802.19       
4    1116.0    202.36      
5    1117.0   1507.33      
6    1118.0      0.03      
7    1119.0      0.00      
8    1120.0      0.00      
9    1121.0     24.28      
10   1122.0    376.87      
11   1123.0      0.25      
14     12.0  80179.92      
15    121.0  80179.92      
16  12101.0      0.00      

I tried calculating this by adding a new column for the next row, adding a new column for the Length of the characters for the current row.

df3['Next_Account'] = df3['Account'].shift(-1)
df3['Len_account'] = df3['Account'].str.len()-2

    Account    Amount Next_account  Len_Account  
0      11.0   1000.82        111.0            2   
1     111.0   1000.42       1111.0            3      
2    1111.0    791.51       1115.0            4     
3    1115.0   1802.19       1116.0            4      
4    1116.0    202.36       1117.0            4      
5    1117.0   1507.33       1118.0            4      
6    1118.0      0.03       1119.0            4       
7    1119.0      0.00       1120.0            4        
8    1120.0      0.00       1121.0            4        
9    1121.0     24.28       1122.0            4        
10   1122.0    376.87       1123.0            4       
11   1123.0      0.25         12.0            4          
14     12.0  80179.92        121.0            2        
15    121.0  80179.92      12101.0            3        
16  12101.0      0.00      12102.0            5        
      

I tried getting the number of characters of the Next account by using the string function, but this does not work for some reason.

df3['current_digits_next'] = df3['Next_Account'].str[:df3['Len_Account']]
df3

    current_digits_next  
0                   NaN  
1                   NaN  
2                   NaN  
3                   NaN  
4                   NaN  
5                   NaN  
6                   NaN  
7                   NaN  
8                   NaN  
9                   NaN  
10                  NaN  
11                  NaN  
14                  NaN  
15                  NaN  
16                  NaN  

The preferred output is:

    current_digits_next  
0                   11  
1                   111  
2                   1115  
3                   1116 
4                   1117 
5                   1118 
6                   1119 
7                   1120 
8                   1121 
9                   1122 
10                  1123  
11                  12.0  
14                  12  
15                  121  
16                  12102  

With the preferred output i can match the data and exclude the grouped accounts. What am i doing wrong?

CodePudding user response:

str accessor accepts int rather Series as index. You can try apply on rows

df3['current_digits_next'] = df3.apply(lambda row: str(row['Next_Account'])[:row['Len_account']], axis=1)
    Account    Amount Next_Account  Len_account current_digits_next
0      11.0   1000.82        111.0            2                  11
1     111.0   1000.42       1111.0            3                 111
2    1111.0    791.51       1115.0            4                1115
3    1115.0   1802.19       1116.0            4                1116
4    1116.0    202.36       1117.0            4                1117
5    1117.0   1507.33       1118.0            4                1118
6    1118.0      0.03       1119.0            4                1119
7    1119.0      0.00       1120.0            4                1120
8    1120.0      0.00       1121.0            4                1121
9    1121.0     24.28       1122.0            4                1122
10   1122.0    376.87       1123.0            4                1123
11   1123.0      0.25         12.0            4                12.0
12     12.0  80179.92        121.0            2                  12
13    121.0  80179.92      12101.0            3                 121

CodePudding user response:

You can convert your Account field to a string and then use apply to check for the required condition

s1 = df['Account'].astype(int).astype(str)
s2 = df['Account'].astype(int).astype(str).shift(-1)
s3 = pd.concat([s1, s2], axis=1, ignore_index=True).loc[:len(s1), :].apply(lambda x: x[0] in x[1], axis=1)
df = pd.concat([df, s3], axis=1).fillna(False)
print(df)
    Account    Amount      0
0      11.0   1000.82   True
1     111.0   1000.42   True
2    1111.0    791.51  False
3    1115.0   1802.19  False
4    1116.0    202.36  False
5    1117.0   1507.33  False
6    1118.0      0.03  False
7    1119.0      0.00  False
8    1120.0      0.00  False
9    1121.0     24.28  False
10   1122.0    376.87  False
11   1123.0      0.25  False
14     12.0  80179.92   True
15    121.0  80179.92   True
16  12101.0      0.00  False
  • Related