Home > Mobile >  Convert string column to array of fixed length strings in pandas dataframe
Convert string column to array of fixed length strings in pandas dataframe

Time:09-26

I have a pandas dataframe with a few columns. I want to convert one of the string columns into an array of strings with fixed length.

Here is how current table looks like:

 ----- -------------------- -------------------- 
|col1 |         col2       |         col3       |
 ----- -------------------- -------------------- 
|   1 |Marco               | LITMATPHY          |
|   2 |Lucy                | NaN                |
|   3 |Andy                | CHMHISENGSTA       |
|   4 |Nancy               | COMFRNPSYGEO       |
|   5 |Fred                | BIOLIT             |
 ----- -------------------- -------------------- 

How can I split string of "col 3" into array of string of length 3 as follows: PS: There can be blanks or NaN in the col 3 and they should be replaced with empty array.

 ----- -------------------- ---------------------------- 
|col1 |         col2       |         col3               |
 ----- -------------------- ---------------------------- 
|   1 |Marco               | ['LIT','MAT','PHY]         |
|   2 |Lucy                | []                         |
|   3 |Andy                | ['CHM','HIS','ENG','STA']  |
|   4 |Nancy               | ['COM','FRN','PSY','GEO']  |
|   5 |Fred                | ['BIO','LIT']              |
 ----- -------------------- ---------------------------- 

CodePudding user response:

Use textwrap.wrap:

import textwrap

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else [])

If there are string whose lenghts aren't the multiple of 3, the remaining letters will be pushed to the last. If you only want to have strings of lenght 3, you can apply one more to get rid of those strings:

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else []).\
           apply(lambda x: x[:-1] if len(x[-1]) % 3 != 0 else x)

CodePudding user response:

Another way can be this;

import pandas as pd
import numpy as np
df = pd.DataFrame({"col3":["LITMATPHY",np.nan,"CHMHISENGSTA","COMFRNPSYGEO","BIOLIT"]})

def split_str(s):
    lst=[]
    for i in range(0,len(s),3):
        lst.append(s[i:i 3])
    return lst

df["col3_result"] = df["col3"].apply(lambda x: [] if pd.isna(x) else split_str(s=x))

# Output

           col3           col3_result
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2  CHMHISENGSTA  [CHM, HIS, ENG, STA]
3  COMFRNPSYGEO  [COM, FRN, PSY, GEO]
4        BIOLIT            [BIO, LIT]

CodePudding user response:

With only using Pandas we can do:

df = pd.DataFrame(['LITMATPHY', np.nan, '', 'CHFDIOSFF', 'CHFIOD', 'FHDIFOSDFJKL'], columns=['col3'])

def to_list(string, n):
    if string != string: # True if string = np.nan
        lst = []
    else:
        lst = [string[i:i n] for i in range(0, len(string), n)]
    return lst

df['new_col3'] = df['col3'].apply(lambda x: to_list(x, 3))

Output:

           col3              new_col3
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2                                  []
3     CHFDIOSFF       [CHF, DIO, SFF]
4        CHFIOD            [CHF, IOD]
5  FHDIFOSDFJKL  [FHD, IFO, SDF, JKL]
  • Related