i'm trying to do some date cleaning on some huge datasets, and I'm new to python (I've used google to search for my problem), so please bear over me with my terminology.
The data is imported from a CSV into a pandas.core.frame.DataFrame Some of my columns should only contain numbers and others only text:
CPRNUM REQ_SAMPLETIME SAMPLE_ID RESULT
0 1234567890 2014-05-30 07:59 50226686 0.7409090909090907
1 The sample was.. 2013-09-04 07:45 47721186 0.8290909090909093
2 1234567890 The sample was.. 46473016 1.0918181818181818
I would really like to get rid of the rows within column REQ_SAMPLETIME and CPRNUM which is not 10 digits long, and contain text, so it would look like:
CPRNUM REQ_SAMPLETIME SAMPLE_ID RESULT
0 1234567890 2014-05-30 07:59 50226686 0.7409090909090907
3 0987654321 2018-06-10 05:32 12354678 3.7290909090909093
4 1234567890 2013-09-04 07:45 15672687 5.9999951818181818
Thanks for the help
CodePudding user response:
Another way to solve this is as below
import pandas as pd
import re
df = pd.read_csv('data.csv')
#This function checks if the number is valid
def isCPRNUMvalid(val):
if len(val) == 10: #if string has length of 10
if val.isnumeric(): #if string is pure number
return True
return False
#This functoin checks if datetime is valid
#You can modify this to get date-time from string if present
def isREQSAMLETIMEvalid(st):
r = r'[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}' #regex to match
if re.match(r, st):
return True
return False
df['validCRPNUM'] = df['CPRNUM'].apply(isCPRNUMvalid)
df['validREQ_SAMPLETIME'] = df['REQ_SAMPLETIME'].apply(isREQSAMLETIMEvalid)
df = df[df.validCRPNUM] #if CRPNUM is TRUE
df = df[df.validREQ_SAMPLETIME] #if REQ_SAMPLETIME is TRUE
print(df)
CodePudding user response:
Assuming your dataframe is stored in df variable, you can try -
df = df[(df["CPRNUM"].str.isdigit()) & (df["CPRNUM"].str.len() == 10) & (~df["REQ_SAMPLETIME"].str.contains("[A-Za-z]"))]
CodePudding user response:
I would suggest casting your columns to the desired format (e.g. numeric;datetime
) and removing rows with problems:
v0 = df.CPRNUM.astype(str).str.len() == 10
v1 = pd.to_numeric(df.CPRNUM, errors="coerce").notna()
v2 = pd.to_datetime(df.REQ_SAMPLETIME, errors="coerce").notna()
filtered = df[v0 * v1 * v2]
print(filtered)
Explanation:
pd.to_<type>
casts the input to the given typeerrors="coerce"
turns the problematic elements to NAs.notna()
gives a boolean vector with true where the value is not NA*
multiplying the boolean vectors gives us the vector corresponding to theand
operator
Data
import pandas as pd
# Sample data
records = [
dict(CPRNUM="1234567890", REQ_SAMPLETIME="2014-05-30 07:59", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
dict(CPRNUM="The sample was..", REQ_SAMPLETIME="2014-05-30 07:59", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
dict(CPRNUM="0987654321", REQ_SAMPLETIME="The sample was..", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
dict(CPRNUM="123", REQ_SAMPLETIME="The sample was..", SAMPLE_ID=50226686, RESULT=0.7409090909090907)
]
df = pd.DataFrame.from_records(records)
print(df)
# CPRNUM REQ_SAMPLETIME SAMPLE_ID RESULT
# 0 1234567890 2014-05-30 07:59 50226686 0.740909
# 1 The sample was.. 2014-05-30 07:59 50226686 0.740909
# 2 0987654321 The sample was.. 50226686 0.740909
# 3 123 The sample was.. 50226686 0.740909