Extract specific words from dataframe-CodePudding

I have the following dataframe named marketing where i would like to extract out source= from the values. Is there a way to create a general regex function so that i can apply on other columns as well to extract words after equal sign?

Data      
source=book,social_media=facebook,ads=Facebook
source=book,ads=Facebook,customer=2
cost=2, customer=3

Im using python and i have tried the following:

df = pd.DataFrame()
def find_keywords(row_string):
    tags = [x for x in row_string if x.startswith('source=')]
    return tags
df['Data'] = marketing['Data'].apply(lambda row : find_keywords(row))

May i know whether there is a more efficient way to extract and place into columns:

source  social_media  ads  customer costs
book     facebook     facebook -    -
book      -           facebook 2    -

CodePudding user response：

You can split the column value of string type into dict then use pd.json_normalize to convert dict to columns.

out = pd.json_normalize(marketing['Data'].apply(lambda x: dict([map(str.strip, i.split('=')) for i in x.split(',')]))).dropna(subset='source')

print(out)

  source social_media       ads customer cost
0   book     facebook  Facebook      NaN  NaN
1   book          NaN  Facebook        2  NaN

CodePudding user response：

Here's another option:

Sample dataframe marketing is:

marketing = pd.DataFrame(
    {"Data": ["source=book,social_media=facebook,ads=Facebook",
              "source=book,ads=Facebook,customer=2",
              "cost=2, customer=3"]}
)

                                             Data
0  source=book,social_media=facebook,ads=Facebook
1             source=book,ads=Facebook,customer=2
2                              cost=2, customer=3

Now this

result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
          .str.split(r"\s*=\s*", expand=True).pivot(columns=0))

does produce

          1                                  
0       ads cost customer social_media source
0  Facebook  NaN      NaN     facebook   book
1  Facebook  NaN        2          NaN   book
2       NaN    2        3          NaN    NaN

which is almost what you're looking for, except for the extra column level and the column ordering. So the following modification

result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
          .str.split(r"\s*=\s*", expand=True).rename(columns={0: "columns"})
          .pivot(columns="columns").droplevel(level=0, axis=1))
result = result[["source", "social_media", "ads", "customer", "cost"]]

should fix that:

columns source social_media       ads customer cost
0         book     facebook  Facebook      NaN  NaN
1         book          NaN  Facebook        2  NaN
2          NaN          NaN       NaN        3    2