I have specific values in a string in a Dataframe row, which is a complete mess. These are reviews scores scraped in the following format:
data = {'split': ['5, 60% 4, 20% 3, 7% 2, 5% 1, 7% 4.2 out of 5']}
df = pd.DataFrame(data)
I would need this format, meaning that I have to split above long string into these columns and corresponding values. Each rating (1-5) should have corresponding % assigned.
data = {'5': ['60%'], '4': ['20%'], '3': ['7%'], '2': ['5%'], '1': ['7%'], 'rating': [4.2]}
df = pd.DataFrame(data)
I was thinking about df['split'].str.extract(pat = '(\d %)')
to extract all the numbers with % at first, but doesn't really work. It extracts only first number with %, not all from the string. Any ideas how to proceed?
CodePudding user response:
You can use
import pandas as pd
data = {'split': ['5, 60% 4, 20% 3, 7% 2, 5% 1, 7% 4.2 out of 5', '15, 20% 4, 40% 3, 5% 2, 2% 1, 37% 3.2 out of 5']}
df = pd.DataFrame(data)
df_out = df['split'].str.extractall(r'(\d %)')[0].unstack().reset_index(level=0, drop=True)
df_out['rating'] = df['split'].str.extract(r'(\d*\.?\d )\s out\s of\b', expand=False)
Output:
>>> df_out
match 0 1 2 3 4 rating
0 60% 20% 7% 5% 7% 4.2
1 20% 40% 5% 2% 37% 3.2
Notes:
df['split'].str.extractall(r'(\d %)')[0].unstack().reset_index(level=0, drop=True)
extracts all\d %
matches and puts the results into a dataframe with unstacked values and reset indexdf['split'].str.extract(r'(\d*\.?\d )\s out\s of\b', expand=False)
extracts the int or float number before anout of
substring.