There's a dataframe below:
---------
| Value|
---------
|X1A14 |
|X20P79 |
|A50B7P60 |
|G24C5C6B8|
---------
items in the value
column do not have fixed length. For example, X1A14
is consist of two words, which are X1
and A14
. A50B7P60
are A50
, B7
and P60
.
I want to split every character, but I need to keep the character, like this:
--------- --- --- --- --
| Value| A| B| C| D|
--------- --- --- --- --
|X1A14 |X1 |A14| | |
|X20P79 |X20|P79| | |
|A50B7P60 |A50|B7 |P60| |
|G24C5C6B8|G24|C5 |C6 |B8|
--------- --- --- --- --
Finally, I want to make a mark for every column. I cannot confirm how many columns are, because there are four words to combine into an item in the last, so we have four columns to mark in this case.
Below is the final output:
--------- --- ----- --- ----- --- ----- -- -----
| Value| A|mark1| B|mark2| C|mark3| D|mark4|
--------- --- ----- --- ----- --- ----- -- -----
|X1A14 |X1 | A|A14| B| | C| | D|
|X20P79 |X20| A|P79| B| | C| | D|
|A50B7P60 |A50| A|B7 | B|P60| C| | D|
|G24C5C6B8|G24| A|C5 | B|C6 | C|B8| D|
--------- --- - --- --- ----- --- ----- -- -----
I had tried split
function, but it cannot keep the word delimeter left.....
CodePudding user response:
I suppose you are trying to match every substring that starts with an uppercase character and ends before the next uppercase character or the end of the string.
You can use extractall
with regular expression pattern ([A-Z][0-9] )
as follows
import pandas as pd
# sample data
df = pd.DataFrame({
'value': ['X1A14','X20P79','A50B7P60','G24C5C6B8']
})
# extract
extractions = df['value'].str.extractall('([A-Z][0-9] )')
# reshape
extractions['mark'] = extractions.index.get_level_values(1).values
extractions = extractions.rename(columns={0: 'value'}).unstack().swaplevel(axis=1).sort_index(axis=1)
extractions.columns = [col[0] if col[1]=='group' else col[1] str(col[0]) for col in extractions.columns.values]
# append to original data
pd.concat([df, extractions], axis=1)
which results in
value mark0 value0 mark1 value1 mark2 value2 mark3 value3
0 X1A14 0.0 X1 1.0 A14 NaN NaN NaN NaN
1 X20P79 0.0 X20 1.0 P79 NaN NaN NaN NaN
2 A50B7P60 0.0 A50 1.0 B7 2.0 P60 NaN NaN
3 G24C5C6B8 0.0 G24 1.0 C5 2.0 C6 3.0 B8
This is slightly different to your expected result because it uses numeric identifiers instead of characters to identify each match. You did not specify whether you have a variable amount of substrings in column value
(given that this is only an excerpt of your data), so this may be more robust.
CodePudding user response:
You can use str.split
with expand = True
and regex = (?!^)(?=\D )
to create columns and then create mark_cols and then finally concat.
t = df["Value"].str.split("(?!^)(?=\D )", expand=True).fillna("")
mark_cols = ["mark" str(x 1) for x in t.columns]
t.columns = t.columns.map(lambda x: chr(ord("A") x))
t[mark_cols] = pd.DataFrame(
dict(zip(mark_cols, [[col] * len(t.columns) for col in t.columns]))
)
out = pd.concat([df, t], axis=1)
print(out)
Value A B C D mark1 mark2 mark3 mark4
0 X1A14 X1 A14 A B C D
1 X20P79 X20 P79 A B C D
2 A50B7P60 A50 B7 P60 A B C D
3 G24C5C6B8 G24 C5 C6 B8 A B C D