I have large csv data file containing few columns with mixed data type, Integer and String with mathematical operators. for example.
column1 column2
70 90
60 3 50
80 30-4
90 5 40 7
45-4 60
50 5 56
50-5 32 20
Is there anyway in pandas dataframe that I can read column values as Integer only? I want to read string values as output of operation.
Desired output should be,
column1 column2
70 90
63 50
80 26
95 47
41 60
55 56
45 52
Thanks in advance! DP
CodePudding user response:
(To expand on my comment...)
You will need to evaluate the expressions.
While eval()
is be able to do that, it's also able to evaluate any arbitrary Python code, including things that will steal your secrets, destroy your system, etc.
Thus, you'll need to make sure there's nothing untoward in there; below, I'm using a regexp to ensure all characters in the string are either digits or one of the fundamental mathematical operators. (The eval()
call can still fail due to invalid syntax, e.g. 3*/3
, or it can take a long time or result in a big, big value (333**999
).)
The data
here is self-contained for example's sake, but naturally you can read it in from a file or whatnot.
import io
import re
import pandas as pd
CSV_DATA = """
column1;column2
70;90
60 3;50
80;30-4
90 5;40 7
45-4;60
50 5;56
50-5;32 20
""".strip()
safe_re = re.compile("^[- */0-9] $")
data = pd.read_csv(io.StringIO(CSV_DATA), sep=";")
def maybe_evaluate(val):
if isinstance(val, str):
if val.isdigit():
# already only numbers, coerce to integer
return int(val)
if safe_re.match(val):
# should be safe to evaluate (only numbers and operators)
return eval(val)
# Otherwise, return the value as-is (you could raise an error here)
return val
data["column1"] = data["column1"].apply(maybe_evaluate)
data["column2"] = data["column2"].apply(maybe_evaluate)
print(data)
This prints out
column1 column2
0 70 90
1 63 50
2 80 26
3 95 47
4 41 60
5 55 56
6 45 52
as expected.
CodePudding user response:
Instead of implementing safe eval
ourselves we can also take advantage that pandas
already provides such implementation:
import pandas as pd
df = pd.DataFrame({
'column1': ['70', '60 3', '80', '90 5', '45-4', '50 5', '50-5'],
'column2': ['90', '50', '30-4', '40 7', '60', '56', '32 20']
})
for col in df.columns:
df[col] = pd.eval(df[col])
Or with apply
:
df = df.apply(pd.eval)
Either produce df
:
column1 column2
0 70 90
1 63 50
2 80 26
3 95 47
4 41 60
5 55 56
6 45 52
pandas.eval
will:
Evaluate a Python expression as a string using various backends.
The following arithmetic operations are supported: , -, *, /, **, %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.
The "string cannot contain any Python statements, only Python expressions."