Convert contents from a list tp dataframe-CodePudding

I have a list which looks something like this,

list = ['some random sentence','some random sentence 25% •Assignments for week1',
        'some random sentence','some random sentence 20% •Exam for week2','some random sentence',
        'some random sentence']

This is extracted from a pdf. I want to take only specific characters and words from a specific value in this list and convert it into pandas dataframe, something like this,

The word 'Assignment' is just an example, there could be different words but always after the percentage sign. It may have multiple spaces or sometimes 1-2 special characters. Is there a way to do this?

CodePudding user response：

With str.extract:

l = ['some random sentence','some random sentence 25% •Assignments for week1',
        'some random sentence','some random sentence 20% •Exam for week2','some random sentence',
        'some random sentence']


out = (pd.Series(l)
         .str.extract(r'(?P<Weight>\d %)\W*(?P<Object>\w )')
         .dropna(subset='Object')
       )

print(out)

Output:

  Weight       Object
1    25%  Assignments
3    20%         Exam

older answer

If you have a single term to match:

l = ['some random sentence','some random sentence 25% •Assignments for week1',
        'some random sentence','some random sentence 20% •Assignments for week2','some random sentence',
        'some random sentence']

s = pd.Series(l)
m = s.str.contains('assignment', case=False)

out = (s[m].str.extract(r'(?P<Weight>\d %)')
       .assign(Object='Assignment')
       )

print(out)

Alternative with a regex to match any number of terms:

s = pd.Series(l)
out = (s.str.extractall(r'(?P<Object>Assignment|otherword)|(?P<Weight>\d %)')
       .groupby(level=0).first()
       )

Output:

       Object Weight
1  Assignment    25%
3  Assignment    20%

CodePudding user response：

I think the most simple is using regex :

import re

def regex_split(sentence):
    match = re.search(r".  (\d )% •(\w ) for week\d", sentence)
    if match:
        return match.group(1)   '%', match.group(2)
    else:
        return "None"

df = pd.DataFrame({"sentence": list})
df["data"] = df["sentence"].apply(lambda x: regex_split(x))
df = df[df["data"] != "None"]
df["Object"] = df["data"].apply(lambda x: x[0])
df["Weight"] = df["data"].apply(lambda x: x[1])
df.drop(["sentence", "data"], axis=1)