Home > Software design >  I have a non normalised.db file and I need to create a dataframe df_exams using Pandas
I have a non normalised.db file and I need to create a dataframe df_exams using Pandas

Time:12-19

I have a non normalised.db file and I need to create a dataframe df_exams from the column 'Exams' of the non-normalised db file. Now the column exams of the non-normalised db file looks like this:

The original non-normalised file has columns of Student ID and Exams like this:

Student ID Exams
1 exam7 (2017), exam9 (2018), exam3 (2018),...
2 exam2(2017), exam2(2017), exam8 (2018),...
3 exam7 (2017), exam9 (2018), exam3 (2018),...

And I need it like

Student ID Exam Year
1 exam7 2017
1 exam9 2018
1 exam3 2018

and so on. I am fairly new to python and appreciate the help.

I had written this code:

df_exams[['Exams','Year']]= df_exams.Exams.str.extract('(.)\s\((.\d )', expand=True)

This does not produce the desired output.

CodePudding user response:

I recreated your dataframe using

import pandas as pd

my_df = pd.DataFrame({'Student ID': [1, 2, 3], "Exams": ["exam7 (2017), exam9 (2018), exam3 (2018)"]*3})

then I split your Exams column

my_df["Exams"] =my_df["Exams"].str.split(",")

this puts list of exams in Exams column.

Then I explode the column to make multiple rows:

my_df = my_df.explode("Exams")

which gives:

   Student ID          Exams
0           1   exam7 (2017)
0           1   exam9 (2018)
0           1   exam3 (2018)
1           2   exam7 (2017)
1           2   exam9 (2018)
1           2   exam3 (2018)
2           3   exam7 (2017)
2           3   exam9 (2018)
2           3   exam3 (2018)

Now I extract the values required using regex:

my_df["Year"] = my_df["Exams"].str.extract(r"\((\d{4})\)")
my_df["Exam"] = my_df["Exams"].str.extract(r"(exam\d{1,2})")

Which gives me a dataframe like this:

   Student ID          Exams  Year   Exam
0           1   exam7 (2017)  2017  exam7
0           1   exam9 (2018)  2018  exam9
0           1   exam3 (2018)  2018  exam3
1           2   exam7 (2017)  2017  exam7
1           2   exam9 (2018)  2018  exam9
1           2   exam3 (2018)  2018  exam3
2           3   exam7 (2017)  2017  exam7
2           3   exam9 (2018)  2018  exam9
2           3   exam3 (2018)  2018  exam3

You can filter the columns you want like:

my_cols = my_df[["Stzudent ID", "Exam", "Year"]]

and your dataframe will look like what you have posted here.

CodePudding user response:

With your data here:

import pandas as pd
test_data =[
    [1,"exam7 (2017), exam9 (2018), exam3 (2018)"],
    [2,"exam2(2017), exam2(2017), exam8 (2018)"],
    [3,"exam7 (2017), exam9 (2018), exam3 (2018)"]
]
df = pd.DataFrame(test_data,columns=["Student ID","Exams"])

You can do it like this:

result = df.Exams.str.extractall(
    r"([^,]*?)\s*\((\d )\)"
).reset_index(
    level=["match"],drop=True
).reset_index().rename(
    columns={
        0:"Exam",
        1:"Year",
        "index":"Student ID"
    }
)

And the result is:

    Student ID  Exam    Year
0   0   exam7   2017
1   0   exam9   2018
2   0   exam3   2018
3   1   exam2   2017
4   1   exam2   2017
5   1   exam8   2018
6   2   exam7   2017
7   2   exam9   2018
8   2   exam3   2018
  • Related