In a column in a Pandas Dataframe, I have strings like this:
column_name_1 | column_name_2 |
---|---|
a^b^c^d | j |
e^f^g | k^l |
h^i | m |
I need to split these strings into columns in the same data frame, like this
column_name_1 | column_name_2 | column_name_1_1 | column_name_1_2 | column_name_1_3 | column_name_2_1 | column_name_2_2 |
---|---|---|---|---|---|---|
a^b^c | j | a | b | c | j | |
e^f^g | k^l | e | f | g | k | l |
h^i | m | h | i | m |
I cannot figure out how to do this without knowing in advance how many occurrences of the delimiter there is in the data.
My best effort either includes something like
df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)
But it failes with a
ValueError: The columns in the computed data do not match the columns in the provided metadata
CodePudding user response:
Let's try it with stack
str.split
unstack
join
.
The idea is to split each column by ^
and expand the split characters into a separate column. stack
helps us do a single str.split
on a Series object and unstack
creates a DataFrame with the same index as the original.
tmp = df.stack().str.split('^', expand=True).unstack(level=1).sort_index(level=1, axis=1)
tmp.columns = [f'{y}_{x 1}' for x, y in tmp.columns]
out = df.join(tmp).dropna(how='all', axis=1).fillna('')
Output:
column_name_1 column_name_2 column_name_1_1 column_name_1_2 column_name_1_3 column_name_1_4 column_name_2_1 column_name_2_2
0 a^b^c^d j a b c d j
1 e^f^g k^l e f g k l
2 h^i m h i m
CodePudding user response:
One-liner:
new_df = pd.concat([df] [pd.DataFrame([pd.Series(s) for s in df[col].str.split('^')]).add_prefix(c.name '_') for col in df], axis=1).fillna('')
Output:
>>> new_df
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 column_name_1_3 column_name_1_0 column_name_1_1
0 a^b^c^d j a b c d j
1 e^f^g k^l e f g k l
2 h^i m h i m