I'm working with a pandas dataframe similar to the sample data below.
I need to be able to create a new column, year, by looking at the data in a timestamp field.
However, the timestamp field is a bit corrupted. Sometimes the years are invalid (see Spa record), or there were two entries entered into the field (see Popeyes).
I used a function to identify which values may not contain value dates as my starting point. Then leveraging that function to identify which values I should substring the year from for the new column.
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['Popeyes', '2021/09/21 : 8:30 PM; 2022/10/21 : 6:30 PM'], ['Apple Store', '2021/09/21 : 10:00 AM']
, ['White Castle', '2022/10/23 : 7:00 AM'], ['Spa', '202233/10/25 : 7:00 AM']
,['Gas', '2022/10/26 : 1:00 PM']
,['Target', '202299/10/27 : 4:00 PM'],['Movie Theater', '2022/10/26 : 1:00 PM']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Transaction', 'Swipe timestamp'])
# print dataframe.
df
from dateutil.parser import parse
def is_date(string, fuzzy=False):
"""
Return whether the string can be interpreted as a date.
:param string: str, string to check for date
:param fuzzy: bool, ignore unknown tokens in string if True
"""
try:
parse(string, fuzzy=fuzzy)
return True
except ValueError:
return False
df["is_date_check"]=df["Swipe timestamp"].apply(is_date,fuzzy=True)
df
def extract_year(row):
if row['is_date_check'] ==True:
year = df["Swipe timestamp"].str[:4]
else:
year=''
return year
df['year'] = df.apply (lambda row: extract_year(row), axis=1)
df
CodePudding user response:
You need change df
in your extract_year
to row
def extract_year(row):
if row['is_date_check'] ==True:
year = row["Swipe timestamp"][:4] # <--- here
else:
year=''
return year
df['year'] = df.apply(extract_year, axis=1)
Or with np.where
df['year'] = np.where(df['is_date_check'], df['Swipe timestamp'].str[:4], '')