Home > Back-end >  Verifying the data format in columns for a pandas dataframe
Verifying the data format in columns for a pandas dataframe

Time:08-05

I have a dataframe of strings representing numbers (integers and floats).

I want to implement a validation to make sure the strings in certain columns only represent integers.

Here is a dataframe containing two columns, with header str as ints and str as double, representing integers and floats in string format.

# Import pandas library
import pandas as pd

# initialize list elements
data = ['10','20','30','40','50','60']

# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['str as ints'])
df['str as double'] = ['10.0', '20.0', '30.0', '40.0', '50.0', '60.0']

Here is a function I wrote that checks for the radix in the string to determine whether it is an integer or float.

def includes_dot(s):
    return '.' in s

I want to see if I can use the apply function on this dataframe, or do I need to write another function where I pass in the name of the dataframe and the list of column headers and then call includes_dot like this:

def check_df(df, lst):
    for val in lst:
        apply(df[val]...?)
    # then print out the results if certain columns fail the check

Or if there are better ways to do this problem altogether.

The expected output is a list of column headers that fails the criteria: if I have a list ['str as ints', 'str as double'], then str as double should be printed because that column does not contain all integers.

CodePudding user response:

for col in df:
    if df[col].str.contains('\.').any():
        print(col, "contains a '.'")
  • Related