Home > Net >  How do I find irregular values in columns in a data-frame that have a huge number of unique values?
How do I find irregular values in columns in a data-frame that have a huge number of unique values?

Time:09-07

Below is a sample of two columns in a data-frame containing data about user-reviews for various Google Play Store apps.

Last Updated current Version
January 7, 2018 1.0.0
1.0.19 1.2.1
March 17, 2018 Varies with device

In these columns I want to find any anomalies/irregular values (such as '1.0.19' in the column, 'Last Updated' and 'varies with device' in the column, 'current Version' as seen in the above table) during data cleaning. However, these columns respectively have 1378 and 2832 unique values. How do I scan through these values and find the anomalies in the quickest/most efficient way possible without having to go through each unique value in the huge list of values?

CodePudding user response:

you can try something like this:

df = pd.read_csv('my_file.csv')
def time_search(x):
    try:
        return pd.to_datetime(x)
    except:
        print("found extrange value:", x)
        return pd.NA

df['Last Updated'] = df['Last Updated'].apply(time_search)

output

found extrange value: 1.0.19

then should be easy to drop nan values for example

for the version column is easy to check if is valid or not

df["Current Ver"].str.contains('^[0-9].([0-9].)*')

I sugest to explore this ideas for the rest of columns

  • Related