Home > Back-end >  How to flatten a column containing a list of dates as strings into just date?
How to flatten a column containing a list of dates as strings into just date?

Time:12-23

I have a dataframe containing a column that has a list of dates stored as strings:

# sample dataframe
data = [[1, ["2019-08-02 08:30:56"]], [2, ["2020-08-02 08:30:56"]]]

df = pd.DataFrame(data, columns=["items", "dates"])
df["dates"] = df["dates"].astype(str)
df
  items                  dates
0   1   ['2019-08-02 08:30:56']
1   2   ['2020-08-02 08:30:56']

I would like to do several things:

  • Convert from a list to a string.
  • Convert the string to a date.
  • Eliminate the time stamp.

So that the final dataset would look like this:

    items       dates
0   1      2019-08-02
1   2      2020-08-02

I am able to remove the list brackets by doing this:

df["dates_2"] = df["dates"].apply(lambda x: x[1:-1])

But I am wondering if there is a better way to do all of these things in one step?

CodePudding user response:

Another solution is to apply ast.literal_eval to convert the column to list. Then .explode() it, convert to datetime and get a date part:

from ast import literal_eval

df["dates"] = df["dates"].apply(literal_eval)
df = df.explode("dates")
df["dates"] = pd.to_datetime(df["dates"]).dt.date

print(df)

Prints:

   items       dates
0      1  2019-08-02
1      2  2020-08-02

CodePudding user response:

Example

data = [[1, ["2019-08-02 08:30:56"]], [2, ["2020-08-02 08:30:56"]]]    
df = pd.DataFrame(data, columns=["items", "dates"]) 
df["dates"] = df["dates"].astype(str)

Code

pd.to_datetime(df['dates'].str[2:-2]).dt.normalize()

output

0   2019-08-02
1   2020-08-02
Name: dates, dtype: datetime64[ns] <-- chk dtype



if you want dtype is object not datetime, use following code:

df['dates'].str[2:12]

0    2019-08-02
1    2020-08-02
Name: dates, dtype: object <-- chk dtype

CodePudding user response:

You can either use pandas.Series.strip with pandas.Series.str :

df["dates"] = df["dates"].astype(str).str.strip("['']").str[:10]

Or use pandas.to_datetime with pandas.Series.dt.strftime :

df["dates"] = pd.to_datetime(df["dates"].astype(str).str.strip("['']")).dt.strftime("%Y-%m-%d")

# Output

print(df)

   items       dates
0      1  2019-08-02
1      2  2020-08-02

CodePudding user response:

This is how I would do it:

df.dates = pd.to_datetime(df.dates.str.slice(1,-1)).dt.date

Output:

   items       dates
0      1  2019-08-02
1      2  2020-08-02

It's important to note that dates contains values of type str, each of which looks like a list when displayed but is not a list (each is a str) after running the following code as shown in the original question:

# sample dataframe
data = [[1, ["2019-08-02 08:30:56"]], [2, ["2020-08-02 08:30:56"]]]

df = pd.DataFrame(data, columns=["items", "dates"])
df["dates"] = df["dates"].astype(str)

Therefore, all we need to do is:

  • remove the [ and ] characters using df.dates.str.slice(1,-1) to get the substring of type str that looks like a datetime
  • use pd.to_datetime() to convert the Series to Timestamps
  • use .dt to get at the datetime-like properties of the values in the Series
  • use the .date accessor to convert to a numpy array of objects of type datetime.date
  • Related