I got stuck trying to clean a dataframe similar to this one:
code | course name | EOS | Mid | test |
---|---|---|---|---|
AA101 | Course 1 | 350 | 420 | NaN |
AA102 | Course 2 | 400 | 470 | NaN |
AB101 | Course 3 | #560 | 570 | NaN |
AB102 | Course 4 | 410 | 465 | NaN |
AC101 | Course 5 | # | 522 | NaN |
I need to keep only numerical values in the column EOS
and move #
characters that appear in it to the column test
, to indicate that an additional test is required for that course. This is because some of the values have a #
before the actual number, such as Course 3
, and some have only the #
as the value, such as Course 5
.
The dataframe was created using Camelot to extract those values from a PDF table.
What I need is to take this #
out of this column and add it to the test
column instead.
Is there an easy way to do that?
CodePudding user response:
There is no builtin function to do just this, but it can be done using two lines:
df.loc[df['EOS'].str.contains('#', na=False) 'test'] = '#'
df['EOS'] = df['EOS'].str.replace('#', '')
Output:
>>> df
code course name EOS Mid test
0 AA101 Course 1 350 420 NaN
1 AA102 Course 2 400 470 NaN
2 AB101 Course 3 560 570 #
3 AB102 Course 4 410 465 NaN
4 AC101 Course 5 522
CodePudding user response:
You can do this
data = pd.DataFrame(
[ ["AA101", "Course 1", "350", 420, None],
["AA102", "Course 2", "400", 470, None],
["AB101", "Course 3", "#560", 570, None],
["AB102", "Course 4", "410", 465, None],
["AC101", "Course 5", "#", 522, None] ],
columns = ["code", "course name", "EOS", "Mid", "test"]
)
transformed_data = (
data
.assign(test=lambda d: d["EOS"].apply(lambda d_: "#" if "#" in d_ else None))
.assign(EOS=lambda d: d["EOS"].replace({"#": ""}, regex=True))
)