Is it possible to write a lambda function for this?-CodePudding

I have a dataset with a column "released" in the format "May 11, 2001 (Canada)". I want to break it into 3 columns released_date, released_year, released_month. I have done it as below but I was wondering if and how to write this as a lambda function.

released_date = []

released_country = []

released_year = []

for x in movies['released']:

    date = x.split("(")[0]

    country = x.split("(")[1].replace(')','')

    released_date.append(date)

    released_country.append(country)
    
movies['released_country'] = released_country

movies['released_date'] = released_date

movies['released_date'] = pd.to_datetime(movies['released_date'])

movies['released_year'] = movies['released_date'].dt.year

movies['released_month'] = movies['released_date'].dt.month

CodePudding user response：

Use -

df['date'] = df['a'].str.split('(').str[0].str.strip()
df['released_country'] = df['a'].str.split('(').str[1].str.replace(')','')
df['date_p'] = pd.to_datetime(df['date'], format='%b %d, %Y')
df['released_date'] = df['date_p'].dt.date
df['released_month'] = df['date_p'].dt.month
df['released_year'] = df['date_p'].dt.year

Assuming original column in a

CodePudding user response：

you can make a helper function like below and call them with lambda may be:

>>> import re
>>> data = "May 11, 2001 (Canada)"
>>> def my_func(data):
...     data = re.sub("[(),]", "", data).split(' ')
...     return data[0], data[1], data[2]
...
>>>
>>>
>>> my_func(data)
('May', '11', '2001')

CodePudding user response：

Clear your head of the idea that lambda is any kind of special tool in itself, or that lambda is what's doing the work when you see others' code doing cool things with third-party libraries like Pandas.

All lambda is, is a convenient way to write short, simple functions, without having to give it a name, and put it inline with other code.

In exchange for that, you are very limited: instead of writing a normal function body, you write a single expression (the result of which is returned). That isn't practical in your case.

The neat thing that Pandas does, generally, is repeat code that works on a single cell, across an entire row, column or the entire DataFrame. Being able to do that sort of thing is why you use Pandas.

The Pandas tool that we want here is the apply method of the movies['released'] Series (i.e., column of the DataFrame). That lets us use a function that handles a single entry from that Series, and apply it to the entire thing.

First, we write a normal function that processes a single release-date entry, and gives us a Series of the values we want:

def parse_release_date(x):
    date = pd.to_datetime(x.split("(")[0])
    country = x.split("(")[1].replace(')','')
    return pd.Series((country, date), ('released_country', 'released_date'))

(it is possible to write that as a lambda, but it makes things look a lot messier than they need to. Giving the function a name, here, also makes the code easier to understand.)

As explained in the documentation I linked, now we can apply that to our Series, and we get a DataFrame: each call to the function makes a row of values.

release_dates = movies['released'].apply(parse_release_date)

From there, we can simply insert columns back into movies in the normal way:

movies['released_country'] = release_dates['released_country']
movies['released_date'] = release_dates['released_date']
movies['released_year'] = release_dates['released_date'].dt.year
movies['released_month'] = release_dates['released_date'].dt.month

Alternately, you can use purely the fundamental operations provided by Pandas and as shown in @Vivek Kalyanarangan's answer - the same kind of thing that you do with .dt.year and .dt.month, but to solve the entire problem. .str works like .dt (but you get strings instead of Datetime objects), and it offers replace, split and strip methods that work like the corresponding string method (just applying it to every string in a Series). This is still a Series, so [0] gives you one of the entries, instead of giving you the first character of each string - for that, you need .str[0] as shown.

CodePudding user response：

released -> released_date, released_year, released_month, (released_country)

import re
pattern=r"\s*([a-z|A-Z] )\s*([0-9]{1,2})\s*,\s*([0-9]{4})\s*\((\s*[a-z|A-Z] \s*)\)\s*"
released_month, released_date, released_year, released_country = re.search(pattern, released).groups()

I think I know why you want Lambda but again I couldn't find the need of it.

Lambda is used frequently in following situations.

When a function, method takes a function as an argument but you can't be bothered to defined that function just for using it as argument. Example would be

a.sort(key=lambda x: x[3])
When a function is used just once and you can't be bothered to define it. Example would be

new_list = list(map(lambda x: x**2, old_list))

So anyway my code code will parse "May 11, 2001 (Canada)" format into 4 strings.

May
11
2001
Canada

and then assign each to released_date/year/month/country variables.

You can further apply pandas functions to make those strings represent some datetime concept

To see how regex works, go to https://regexr.com/

and copy paste my "pattern" there