Extract year from corrupted timestamp column-CodePudding

I'm working with a pandas dataframe similar to the sample data below.

I need to be able to create a new column, year, by looking at the data in a timestamp field.

However, the timestamp field is a bit corrupted. Sometimes the years are invalid (see Spa record), or there were two entries entered into the field (see Popeyes).

I used a function to identify which values may not contain value dates as my starting point. Then leveraging that function to identify which values I should substring the year from for the new column.

# Import pandas library
import pandas as pd
  
# initialize list of lists
data = [['Popeyes', '2021/09/21 : 8:30 PM; 2022/10/21 : 6:30 PM'], ['Apple Store', '2021/09/21 : 10:00 AM']
                , ['White Castle', '2022/10/23 : 7:00 AM'], ['Spa', '202233/10/25 : 7:00 AM']
                        ,['Gas', '2022/10/26 : 1:00 PM']
        ,['Target', '202299/10/27 : 4:00 PM'],['Movie Theater', '2022/10/26 : 1:00 PM']]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Transaction', 'Swipe timestamp'])
  
# print dataframe.
df

from dateutil.parser import parse

def is_date(string, fuzzy=False):
    """
    Return whether the string can be interpreted as a date.

    :param string: str, string to check for date
    :param fuzzy: bool, ignore unknown tokens in string if True
    """
    try: 
        parse(string, fuzzy=fuzzy)
        return True

    except ValueError:
        return False
    
df["is_date_check"]=df["Swipe timestamp"].apply(is_date,fuzzy=True)

df

def extract_year(row):
    if row['is_date_check'] ==True:
        year = df["Swipe timestamp"].str[:4] 
    else:
        year=''
    return year

df['year'] = df.apply (lambda row: extract_year(row), axis=1)

df

CodePudding user response：

You need change df in your extract_year to row

def extract_year(row):
    if row['is_date_check'] ==True:
        year = row["Swipe timestamp"][:4] # <--- here
    else:
        year=''
    return year

df['year'] = df.apply(extract_year, axis=1)

Or with np.where

df['year'] = np.where(df['is_date_check'], df['Swipe timestamp'].str[:4], '')