I have a big dataset with alot of columns, being most of them not numeric values. I need to find inconsistencies in the data as well as outliers and the part of obtaining inconsistencies would be easy if the dataset wasn't so big (7032 rows to be exact).
An inconsistency would be something like: ID supposed to be 4 letters and 4 numbers and I obtain something else (like 3 numbers and 2 letters); or other example would be a number that should be a 0 or 1 and I obtain a -1 or a 2 .
Is there any function that I can use to obtain the inconsitencies in each column? For the specific columns that doesn't have numeric values, I thought of doing a regex and validate if each row for a certain column is valid but I didn't found info that could give me that.
For the part of outliers I did a boxplot to see if I could obtain any outlier, like this:
boxplot(dataset$column)
But the graphic didn't gave me any outliers. Should I be ok with the results that I obtain in the graphic or should I try something else to see if there is really any outlier in the data?
CodePudding user response:
For the specific examples you've given:
- an ID must be be four numbers and 4 letters:
!grepl("^[0-9]{4}-[[:alpha:]]{4}$", ID)
will be TRUE for inconsistent values (^
and $
mean beginning- and end-of-string respectively; {4}
means "previous pattern repeats exactly four times"; [0-9]
means "any symbol between 0 and 9 (i.e. any numeral); [[:alpha:]]
means "any alphabetic character"). If you only want uppercase letters you could use [A-Z]
instead (assuming you are not working in some weird locale like Estonian).
If you need a numeric value to be 0 or 1, then
!num_val %in% c(0,1)
will work (this will work for any set of allowed values; you can use it for a specific set of allowed character values as well)If you need a numeric value to be between
a
andb
then!(a < num_val & num_val < b)
...