how to handle mixed datatype-CodePudding

I have large csv data file containing few columns with mixed data type, Integer and String with mathematical operators. for example.

column1  column2
70        90
60 3      50
80        30-4
90 5      40 7
45-4      60
50 5      56
50-5      32 20

Is there anyway in pandas dataframe that I can read column values as Integer only? I want to read string values as output of operation.

Desired output should be,

column1   column2
70          90
63          50
80          26
95          47
41          60
55          56
45          52

Thanks in advance! DP

CodePudding user response：

(To expand on my comment...)

You will need to evaluate the expressions.

While eval() is be able to do that, it's also able to evaluate any arbitrary Python code, including things that will steal your secrets, destroy your system, etc.

Thus, you'll need to make sure there's nothing untoward in there; below, I'm using a regexp to ensure all characters in the string are either digits or one of the fundamental mathematical operators. (The eval() call can still fail due to invalid syntax, e.g. 3*/3, or it can take a long time or result in a big, big value (333**999).)

The data here is self-contained for example's sake, but naturally you can read it in from a file or whatnot.

import io
import re

import pandas as pd

CSV_DATA = """
column1;column2
70;90
60 3;50
80;30-4
90 5;40 7
45-4;60
50 5;56
50-5;32 20
""".strip()

safe_re = re.compile("^[- */0-9] $")

data = pd.read_csv(io.StringIO(CSV_DATA), sep=";")


def maybe_evaluate(val):
    if isinstance(val, str):
        if val.isdigit():
            # already only numbers, coerce to integer
            return int(val)
        if safe_re.match(val):
            # should be safe to evaluate (only numbers and operators)
            return eval(val)
    # Otherwise, return the value as-is (you could raise an error here)
    return val


data["column1"] = data["column1"].apply(maybe_evaluate)
data["column2"] = data["column2"].apply(maybe_evaluate)

print(data)

This prints out

   column1  column2
0       70       90
1       63       50
2       80       26
3       95       47
4       41       60
5       55       56
6       45       52

as expected.

CodePudding user response：

Instead of implementing safe eval ourselves we can also take advantage that pandas already provides such implementation:

import pandas as pd

df = pd.DataFrame({
    'column1': ['70', '60 3', '80', '90 5', '45-4', '50 5', '50-5'],
    'column2': ['90', '50', '30-4', '40 7', '60', '56', '32 20']
})

for col in df.columns:
    df[col] = pd.eval(df[col])

Or with apply:

df = df.apply(pd.eval)

Either produce df:

  column1 column2
0      70      90
1      63      50
2      80      26
3      95      47
4      41      60
5      55      56
6      45      52

pandas.eval will:

Evaluate a Python expression as a string using various backends.

The following arithmetic operations are supported: , -, *, /, **, %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.

The "string cannot contain any Python statements, only Python expressions."