Splitting column by multiple custom delimiters in Python-CodePudding

I need to split a column called Creative where each cell contains samples such as:

pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)

Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:

pn	io	ta	pt	cn	cs
2021	302	Yes	Blue	John	Doe

I tried

 df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)

and

df['Creative Size'] = df['Creative Size'].str.replace(')','')

but got an error, error: missing ), unterminated subpattern at position 2, assuming it has something to do with regular expressions.

Is there an easy way to split these ? Thanks.

CodePudding user response：

Use extract with named capturing groups (see here):

import pandas as pd

# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])

# extract with a named capturing group
res = df["Creative"].str.extract(
    r"pn\((?P<pn>\d )\)io\((?P<io>\d )\)ta\((?P<ta>\w )\)pt\((?P<pt>\w )\)cn\((?P<cn>\w )\)cs\((?P<cs>\w )\)",
    expand=True)

print(res)

Output

     pn   io   ta    pt    cn   cs
0  2021  302  Yes  Blue  John  Doe

CodePudding user response：

I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:

import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((. )\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
    df[col] = subtable[col].values

Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!

David

CodePudding user response：

Try with extractall:

names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)

>>> output
     pn   io   ta    pt    cn   cs
0  2021  302  Yes  Blue  John  Doe
1  2020  301   No   Red  Jane  Doe

Input df:

df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)", 
                                "pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})

CodePudding user response：

We can use str.findall to extract matching column name-value pairs

pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w )\((\w )')))

     pn   io   ta    pt    cn   cs
0  2021  302  Yes  Blue  John  Doe

CodePudding user response：

Using regular expressions, different way of packaging final DataFrame:

import re
import pandas as pd

txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'

data = list(zip(*re.findall('([^\(] )\(([^\)] )\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])