How to extract all specific values from a string in a row in pandas dataframe with regex?-CodePudding

I have specific values in a string in a Dataframe row, which is a complete mess. These are reviews scores scraped in the following format:

 data = {'split': ['5,   60% 4,   20% 3,   7% 2,   5% 1,   7% 4.2 out of 5']}
 df = pd.DataFrame(data)

I would need this format, meaning that I have to split above long string into these columns and corresponding values. Each rating (1-5) should have corresponding % assigned.

 data = {'5': ['60%'], '4': ['20%'],  '3': ['7%'],  '2': ['5%'], '1': ['7%'], 'rating': [4.2]}
 df = pd.DataFrame(data)

I was thinking about df['split'].str.extract(pat = '(\d %)') to extract all the numbers with % at first, but doesn't really work. It extracts only first number with %, not all from the string. Any ideas how to proceed?

CodePudding user response：

You can use

import pandas as pd
data = {'split': ['5,   60% 4,   20% 3,   7% 2,   5% 1,   7% 4.2 out of 5', '15,   20% 4,   40% 3,   5% 2,   2% 1,   37% 3.2 out of 5']}
df = pd.DataFrame(data)
df_out = df['split'].str.extractall(r'(\d %)')[0].unstack().reset_index(level=0, drop=True)
df_out['rating'] = df['split'].str.extract(r'(\d*\.?\d )\s out\s of\b', expand=False)

Output:

>>> df_out
match    0    1   2   3    4 rating
0      60%  20%  7%  5%   7%    4.2
1      20%  40%  5%  2%  37%    3.2

Notes:

df['split'].str.extractall(r'(\d %)')[0].unstack().reset_index(level=0, drop=True) extracts all \d % matches and puts the results into a dataframe with unstacked values and reset index
df['split'].str.extract(r'(\d*\.?\d )\s out\s of\b', expand=False) extracts the int or float number before an out of substring.