I have a non normalised.db file and I need to create a dataframe df_exams from the column 'Exams' of the non-normalised db file. Now the column exams of the non-normalised db file looks like this:
The original non-normalised file has columns of Student ID and Exams like this:
Student ID | Exams |
---|---|
1 | exam7 (2017), exam9 (2018), exam3 (2018),... |
2 | exam2(2017), exam2(2017), exam8 (2018),... |
3 | exam7 (2017), exam9 (2018), exam3 (2018),... |
And I need it like
Student ID | Exam | Year |
---|---|---|
1 | exam7 | 2017 |
1 | exam9 | 2018 |
1 | exam3 | 2018 |
and so on. I am fairly new to python and appreciate the help.
I had written this code:
df_exams[['Exams','Year']]= df_exams.Exams.str.extract('(.)\s\((.\d )', expand=True)
This does not produce the desired output.
CodePudding user response:
I recreated your dataframe using
import pandas as pd
my_df = pd.DataFrame({'Student ID': [1, 2, 3], "Exams": ["exam7 (2017), exam9 (2018), exam3 (2018)"]*3})
then I split your Exams
column
my_df["Exams"] =my_df["Exams"].str.split(",")
this puts list of exams in Exams
column.
Then I explode the column to make multiple rows:
my_df = my_df.explode("Exams")
which gives:
Student ID Exams
0 1 exam7 (2017)
0 1 exam9 (2018)
0 1 exam3 (2018)
1 2 exam7 (2017)
1 2 exam9 (2018)
1 2 exam3 (2018)
2 3 exam7 (2017)
2 3 exam9 (2018)
2 3 exam3 (2018)
Now I extract the values required using regex:
my_df["Year"] = my_df["Exams"].str.extract(r"\((\d{4})\)")
my_df["Exam"] = my_df["Exams"].str.extract(r"(exam\d{1,2})")
Which gives me a dataframe like this:
Student ID Exams Year Exam
0 1 exam7 (2017) 2017 exam7
0 1 exam9 (2018) 2018 exam9
0 1 exam3 (2018) 2018 exam3
1 2 exam7 (2017) 2017 exam7
1 2 exam9 (2018) 2018 exam9
1 2 exam3 (2018) 2018 exam3
2 3 exam7 (2017) 2017 exam7
2 3 exam9 (2018) 2018 exam9
2 3 exam3 (2018) 2018 exam3
You can filter the columns you want like:
my_cols = my_df[["Stzudent ID", "Exam", "Year"]]
and your dataframe will look like what you have posted here.
CodePudding user response:
With your data here:
import pandas as pd
test_data =[
[1,"exam7 (2017), exam9 (2018), exam3 (2018)"],
[2,"exam2(2017), exam2(2017), exam8 (2018)"],
[3,"exam7 (2017), exam9 (2018), exam3 (2018)"]
]
df = pd.DataFrame(test_data,columns=["Student ID","Exams"])
You can do it like this:
result = df.Exams.str.extractall(
r"([^,]*?)\s*\((\d )\)"
).reset_index(
level=["match"],drop=True
).reset_index().rename(
columns={
0:"Exam",
1:"Year",
"index":"Student ID"
}
)
And the result
is:
Student ID Exam Year
0 0 exam7 2017
1 0 exam9 2018
2 0 exam3 2018
3 1 exam2 2017
4 1 exam2 2017
5 1 exam8 2018
6 2 exam7 2017
7 2 exam9 2018
8 2 exam3 2018