I'm stuck in below problem. It would be great if you could help me with the code.
So I have this dataframe df1 which looks like this:
Exams
0 exam7 (2017), exam9 (2018), exam3 (2018), exam...
1 exam2 (2017), exam2 (2017), exam8 (2018), exam...
2 exam7 (2017), exam6 (2017), exam2 (2017), exam...
3 exam10 (2019), exam4 (2019), exam4 (2019), exa...
4 exam4 (2019), exam4 (2019), exam4 (2019)
.. ...
95 exam10 (2019), exam4 (2019), exam6 (2017), exa...
96 exam1 (2016), exam8 (2018)
97 exam3 (2018), exam5 (2020), exam6 (2017)
98 exam3 (2018), exam9 (2018), exam3 (2018), exam...
99 exam8 (2018)
[100 rows x 1 columns]
And I need to break it into two columns: Exam and Year. Also, I need only unique values for each exam and year pair.
Final Output should be like:
Exam,Year
exam1,2016
exam10,2019
exam2,2017
exam3,2018
exam4,2019
exam5,2020
exam6,2017
exam7,2017
exam8,2018
exam9,2018
I figured I need to iterate over each row using iterrows in Pandas and then use apply method to split exam and year and keep only unique values. But my code throws error:
def operations_exam(exam):
for index, row in df1.iterrows():
new=row[index].strip().split(' ')
exams=new[0]
year=new[1][1:-1]
f = lambda row: row[index].strip().split(' ')
for index, row in df1.iterrows():
df1["Exams"] = df1["Exams"].apply(f,axis=1)
CodePudding user response:
You can do it in two lines:
s = df['Exams'].str.findall(r'(exam\d )\s*\((\d )\)').explode().drop_duplicates().reset_index()[0]
new_df = pd.DataFrame({'Exam': s.str[0], 'Year': s.str[1]})
Output:
>>> new_df
Exam Year
0 exam7 2017
1 exam9 2018
2 exam3 2018
3 exam2 2017
4 exam8 2018
5 exam6 2017
6 exam10 2019
7 exam4 2019
8 exam1 2016
9 exam5 2020
CodePudding user response:
It could be simpler if you use the built in functions:
df = pd.DataFrame({'Exams':['exam7 (2017), exam9 (2018), exam3 (2018)', 'exam7 (2017), exam2 (2017), exam8 (2018)']})
pairs = df['Exams'].str.split(', ').sum()
pairs = [p.split(' ') for p in pairs]
new_df = pd.DataFrame(pairs, columns = ['Exam', 'Year']).drop_duplicates(['Exam', 'Year'])
new_df['Year'] = new_df['Year'].str.extract('(\d )').astype('int')