I found this example on another post, but its for R, I need it for Python in a pandas dataframe
Original post = Split string every n characters new column
Suppose I have a data frame like this with a string vector, var2
var1 | var2 |
---|---|
1 | abcdefghi |
2 | abcdefghijklmnop |
3 | abc |
4 | abcdefghijklmnopqrst |
What is the most efficient way to split var2 every n characters into new columns until the end of each string,
e.g if every 4 characters, the output would like look like this:
var1 | var2 | new_var1 | new_var2 | new_var3 | new_var4 | new_var5 |
---|---|---|---|---|---|---|
1 | abcdefghi | abcd | efgh | i | ||
2 | abcdefghijklmnop | abcd | efgh | ijkl | mnop | |
3 | abc | abc | ||||
4 | abcdefghijklmnopqrst | abcd | efgh | ijkl | mnop | qrst |
To make it more difficult, I don't know how big is the longest string in column, but I need the total column to be as many as needed based on the resulting N columns
CodePudding user response:
You can use:
N = 4
# Custom function to split string
split_string = lambda x: pd.Series([x[i:i N] for i in range(0, len(x), N)])
new_var = df['var2'].apply(split_string).fillna('')
new_var.columns = 'new_var' (new_var.columns 1).astype(str).str.zfill(3)
df = df.join(new_var)
Output:
var1 | var2 | new_var001 | new_var002 | new_var003 | new_var004 | new_var005 |
---|---|---|---|---|---|---|
1 | abcdefghi | abcd | efgh | i | ||
2 | abcdefghijklmnop | abcd | efgh | ijkl | mnop | |
3 | abc | abc | ||||
4 | abcdefghijklmnopqrst | abcd | efgh | ijkl | mnop | qrst |
CodePudding user response:
One option is to do all the preprocesssing within vanilla python, before returning to Pandas (it should be fast/efficient):
outcome = [[ent[n:n 4] for n in range(0, len(ent), 4)] for ent in df.var2]
outcome = pd.DataFrame(outcome)
outcome.columns = outcome.columns.map(lambda col: f"new_var{col 1}")
pd.concat([df, outcome], axis="columns")
var1 var2 new_var1 new_var2 new_var3 new_var4 new_var5
0 1 abcdefghi abcd efgh i None None
1 2 abcdefghijklmnop abcd efgh ijkl mnop None
2 3 abc abc None None None None
3 4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst