Splitting and selecting unique rows using Pandas-CodePudding

I'm stuck in below problem. It would be great if you could help me with the code.

So I have this dataframe df1 which looks like this:

                                     Exams
0   exam7 (2017), exam9 (2018), exam3 (2018), exam...
1   exam2 (2017), exam2 (2017), exam8 (2018), exam...
2   exam7 (2017), exam6 (2017), exam2 (2017), exam...
3   exam10 (2019), exam4 (2019), exam4 (2019), exa...
4            exam4 (2019), exam4 (2019), exam4 (2019)
..                                                ...
95  exam10 (2019), exam4 (2019), exam6 (2017), exa...
96                         exam1 (2016), exam8 (2018)
97           exam3 (2018), exam5 (2020), exam6 (2017)
98  exam3 (2018), exam9 (2018), exam3 (2018), exam...
99                                       exam8 (2018)

[100 rows x 1 columns]

And I need to break it into two columns: Exam and Year. Also, I need only unique values for each exam and year pair.

Final Output should be like:

Exam,Year
exam1,2016
exam10,2019
exam2,2017
exam3,2018
exam4,2019
exam5,2020
exam6,2017
exam7,2017
exam8,2018
exam9,2018

I figured I need to iterate over each row using iterrows in Pandas and then use apply method to split exam and year and keep only unique values. But my code throws error:

def operations_exam(exam):
        for index, row in df1.iterrows():
            new=row[index].strip().split(' ')
            exams=new[0]
            year=new[1][1:-1]
    
    f = lambda row: row[index].strip().split(' ')
    for index, row in df1.iterrows():
        df1["Exams"] = df1["Exams"].apply(f,axis=1)

CodePudding user response：

You can do it in two lines:

s = df['Exams'].str.findall(r'(exam\d )\s*\((\d )\)').explode().drop_duplicates().reset_index()[0]
new_df = pd.DataFrame({'Exam': s.str[0], 'Year': s.str[1]})

Output:

>>> new_df
     Exam  Year
0   exam7  2017
1   exam9  2018
2   exam3  2018
3   exam2  2017
4   exam8  2018
5   exam6  2017
6  exam10  2019
7   exam4  2019
8   exam1  2016
9   exam5  2020

CodePudding user response：

It could be simpler if you use the built in functions:

df = pd.DataFrame({'Exams':['exam7 (2017), exam9 (2018), exam3 (2018)', 'exam7 (2017), exam2 (2017), exam8 (2018)']})

pairs = df['Exams'].str.split(', ').sum()
pairs = [p.split(' ') for p in pairs]
new_df = pd.DataFrame(pairs, columns = ['Exam', 'Year']).drop_duplicates(['Exam', 'Year'])
new_df['Year'] = new_df['Year'].str.extract('(\d )').astype('int')