Home > Mobile >  Split by character quantity, create new columns with substrings Python
Split by character quantity, create new columns with substrings Python

Time:02-15

I found this example on another post, but its for R, I need it for Python in a pandas dataframe

Original post = Split string every n characters new column

Suppose I have a data frame like this with a string vector, var2

var1 var2
1 abcdefghi
2 abcdefghijklmnop
3 abc
4 abcdefghijklmnopqrst

What is the most efficient way to split var2 every n characters into new columns until the end of each string,

e.g if every 4 characters, the output would like look like this:

var1 var2 new_var1 new_var2 new_var3 new_var4 new_var5
1 abcdefghi abcd efgh i
2 abcdefghijklmnop abcd efgh ijkl mnop
3 abc abc
4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst

To make it more difficult, I don't know how big is the longest string in column, but I need the total column to be as many as needed based on the resulting N columns

CodePudding user response:

You can use:

N = 4

# Custom function to split string
split_string = lambda x: pd.Series([x[i:i N] for i in range(0, len(x), N)])

new_var = df['var2'].apply(split_string).fillna('')
new_var.columns = 'new_var'   (new_var.columns   1).astype(str).str.zfill(3)

df = df.join(new_var)

Output:

var1 var2 new_var001 new_var002 new_var003 new_var004 new_var005
1 abcdefghi abcd efgh i
2 abcdefghijklmnop abcd efgh ijkl mnop
3 abc abc
4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst

CodePudding user response:

One option is to do all the preprocesssing within vanilla python, before returning to Pandas (it should be fast/efficient):

outcome = [[ent[n:n 4] for n in range(0, len(ent), 4)] for ent in df.var2]
outcome = pd.DataFrame(outcome)
outcome.columns = outcome.columns.map(lambda col: f"new_var{col 1}")
pd.concat([df, outcome], axis="columns")

   var1                  var2 new_var1 new_var2 new_var3 new_var4 new_var5
0     1             abcdefghi     abcd     efgh        i     None     None
1     2      abcdefghijklmnop     abcd     efgh     ijkl     mnop     None
2     3                   abc      abc     None     None     None     None
3     4  abcdefghijklmnopqrst     abcd     efgh     ijkl     mnop     qrst
  • Related