I have a pandas dataframe with a few columns. I want to convert one of the string columns into an array of strings with fixed length.
Here is how current table looks like:
----- -------------------- --------------------
|col1 | col2 | col3 |
----- -------------------- --------------------
| 1 |Marco | LITMATPHY |
| 2 |Lucy | NaN |
| 3 |Andy | CHMHISENGSTA |
| 4 |Nancy | COMFRNPSYGEO |
| 5 |Fred | BIOLIT |
----- -------------------- --------------------
How can I split string of "col 3" into array of string of length 3 as follows: PS: There can be blanks or NaN in the col 3 and they should be replaced with empty array.
----- -------------------- ----------------------------
|col1 | col2 | col3 |
----- -------------------- ----------------------------
| 1 |Marco | ['LIT','MAT','PHY] |
| 2 |Lucy | [] |
| 3 |Andy | ['CHM','HIS','ENG','STA'] |
| 4 |Nancy | ['COM','FRN','PSY','GEO'] |
| 5 |Fred | ['BIO','LIT'] |
----- -------------------- ----------------------------
CodePudding user response:
Use textwrap.wrap
:
import textwrap
df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else [])
If there are string whose lenghts aren't the multiple of 3, the remaining letters will be pushed to the last. If you only want to have strings of lenght 3, you can apply
one more to get rid of those strings:
df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else []).\
apply(lambda x: x[:-1] if len(x[-1]) % 3 != 0 else x)
CodePudding user response:
Another way can be this;
import pandas as pd
import numpy as np
df = pd.DataFrame({"col3":["LITMATPHY",np.nan,"CHMHISENGSTA","COMFRNPSYGEO","BIOLIT"]})
def split_str(s):
lst=[]
for i in range(0,len(s),3):
lst.append(s[i:i 3])
return lst
df["col3_result"] = df["col3"].apply(lambda x: [] if pd.isna(x) else split_str(s=x))
# Output
col3 col3_result
0 LITMATPHY [LIT, MAT, PHY]
1 NaN []
2 CHMHISENGSTA [CHM, HIS, ENG, STA]
3 COMFRNPSYGEO [COM, FRN, PSY, GEO]
4 BIOLIT [BIO, LIT]
CodePudding user response:
With only using Pandas we can do:
df = pd.DataFrame(['LITMATPHY', np.nan, '', 'CHFDIOSFF', 'CHFIOD', 'FHDIFOSDFJKL'], columns=['col3'])
def to_list(string, n):
if string != string: # True if string = np.nan
lst = []
else:
lst = [string[i:i n] for i in range(0, len(string), n)]
return lst
df['new_col3'] = df['col3'].apply(lambda x: to_list(x, 3))
Output:
col3 new_col3
0 LITMATPHY [LIT, MAT, PHY]
1 NaN []
2 []
3 CHFDIOSFF [CHF, DIO, SFF]
4 CHFIOD [CHF, IOD]
5 FHDIFOSDFJKL [FHD, IFO, SDF, JKL]