Home > Software design >  Cleaning data, deleting rows with any text in certain columns
Cleaning data, deleting rows with any text in certain columns

Time:12-14

i'm trying to do some date cleaning on some huge datasets, and I'm new to python (I've used google to search for my problem), so please bear over me with my terminology.

The data is imported from a CSV into a pandas.core.frame.DataFrame Some of my columns should only contain numbers and others only text:

        CPRNUM           REQ_SAMPLETIME     SAMPLE_ID   RESULT      

0       1234567890       2014-05-30 07:59   50226686    0.7409090909090907  
1       The sample was.. 2013-09-04 07:45   47721186    0.8290909090909093  
2       1234567890       The sample was..   46473016    1.0918181818181818  

I would really like to get rid of the rows within column REQ_SAMPLETIME and CPRNUM which is not 10 digits long, and contain text, so it would look like:

        CPRNUM           REQ_SAMPLETIME     SAMPLE_ID   RESULT      

0       1234567890       2014-05-30 07:59   50226686    0.7409090909090907  
3       0987654321       2018-06-10 05:32   12354678    3.7290909090909093  
4       1234567890       2013-09-04 07:45   15672687    5.9999951818181818  

Thanks for the help

CodePudding user response:

Another way to solve this is as below

import pandas as pd
import re

df = pd.read_csv('data.csv')

#This function checks if the number is valid
def isCPRNUMvalid(val):
    if len(val) == 10: #if string has length of 10
        if val.isnumeric(): #if string is pure number 
            return True
    return False

#This functoin checks if datetime is valid
#You can modify this to get date-time from string if present
def isREQSAMLETIMEvalid(st):
    r = r'[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}' #regex to match
    if re.match(r, st):
        return True
    return False


df['validCRPNUM'] = df['CPRNUM'].apply(isCPRNUMvalid)
df['validREQ_SAMPLETIME'] = df['REQ_SAMPLETIME'].apply(isREQSAMLETIMEvalid)

df = df[df.validCRPNUM] #if CRPNUM is TRUE
df = df[df.validREQ_SAMPLETIME] #if REQ_SAMPLETIME is TRUE
print(df)

CodePudding user response:

Assuming your dataframe is stored in df variable, you can try -

df = df[(df["CPRNUM"].str.isdigit()) & (df["CPRNUM"].str.len() == 10) & (~df["REQ_SAMPLETIME"].str.contains("[A-Za-z]"))]

CodePudding user response:

I would suggest casting your columns to the desired format (e.g. numeric;datetime) and removing rows with problems:

v0 = df.CPRNUM.astype(str).str.len() == 10
v1 = pd.to_numeric(df.CPRNUM, errors="coerce").notna()
v2 = pd.to_datetime(df.REQ_SAMPLETIME, errors="coerce").notna()
filtered = df[v0 * v1 * v2]
print(filtered)

Explanation:

  • pd.to_<type> casts the input to the given type
  • errors="coerce" turns the problematic elements to NAs
  • .notna() gives a boolean vector with true where the value is not NA
  • * multiplying the boolean vectors gives us the vector corresponding to the and operator

Data

import pandas as pd

# Sample data
records = [
    dict(CPRNUM="1234567890", REQ_SAMPLETIME="2014-05-30 07:59", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
    dict(CPRNUM="The sample was..", REQ_SAMPLETIME="2014-05-30 07:59", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
    dict(CPRNUM="0987654321", REQ_SAMPLETIME="The sample was..", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
    dict(CPRNUM="123", REQ_SAMPLETIME="The sample was..", SAMPLE_ID=50226686, RESULT=0.7409090909090907)
]
df = pd.DataFrame.from_records(records)
print(df)
#              CPRNUM    REQ_SAMPLETIME  SAMPLE_ID    RESULT
# 0        1234567890  2014-05-30 07:59   50226686  0.740909
# 1  The sample was..  2014-05-30 07:59   50226686  0.740909
# 2        0987654321  The sample was..   50226686  0.740909
# 3               123  The sample was..   50226686  0.740909
  • Related