I have a dataset with a column "released" in the format "May 11, 2001 (Canada)". I want to break it into 3 columns released_date
, released_year
, released_month
. I have done it as below but I was wondering if and how to write this as a lambda function.
released_date = []
released_country = []
released_year = []
for x in movies['released']:
date = x.split("(")[0]
country = x.split("(")[1].replace(')','')
released_date.append(date)
released_country.append(country)
movies['released_country'] = released_country
movies['released_date'] = released_date
movies['released_date'] = pd.to_datetime(movies['released_date'])
movies['released_year'] = movies['released_date'].dt.year
movies['released_month'] = movies['released_date'].dt.month
CodePudding user response:
Use -
df['date'] = df['a'].str.split('(').str[0].str.strip()
df['released_country'] = df['a'].str.split('(').str[1].str.replace(')','')
df['date_p'] = pd.to_datetime(df['date'], format='%b %d, %Y')
df['released_date'] = df['date_p'].dt.date
df['released_month'] = df['date_p'].dt.month
df['released_year'] = df['date_p'].dt.year
Assuming original column in a
CodePudding user response:
you can make a helper function like below and call them with lambda may be:
>>> import re
>>> data = "May 11, 2001 (Canada)"
>>> def my_func(data):
... data = re.sub("[(),]", "", data).split(' ')
... return data[0], data[1], data[2]
...
>>>
>>>
>>> my_func(data)
('May', '11', '2001')
CodePudding user response:
Clear your head of the idea that lambda
is any kind of special tool in itself, or that lambda
is what's doing the work when you see others' code doing cool things with third-party libraries like Pandas.
All lambda
is, is a convenient way to write short, simple functions, without having to give it a name, and put it inline with other code.
In exchange for that, you are very limited: instead of writing a normal function body, you write a single expression (the result of which is returned). That isn't practical in your case.
The neat thing that Pandas does, generally, is repeat code that works on a single cell, across an entire row, column or the entire DataFrame. Being able to do that sort of thing is why you use Pandas.
The Pandas tool that we want here is the apply
method of the movies['released']
Series (i.e., column of the DataFrame). That lets us use a function that handles a single entry from that Series, and apply it to the entire thing.
First, we write a normal function that processes a single release-date entry, and gives us a Series of the values we want:
def parse_release_date(x):
date = pd.to_datetime(x.split("(")[0])
country = x.split("(")[1].replace(')','')
return pd.Series((country, date), ('released_country', 'released_date'))
(it is possible to write that as a lambda
, but it makes things look a lot messier than they need to. Giving the function a name, here, also makes the code easier to understand.)
As explained in the documentation I linked, now we can apply
that to our Series, and we get a DataFrame: each call to the function makes a row of values.
release_dates = movies['released'].apply(parse_release_date)
From there, we can simply insert columns back into movies
in the normal way:
movies['released_country'] = release_dates['released_country']
movies['released_date'] = release_dates['released_date']
movies['released_year'] = release_dates['released_date'].dt.year
movies['released_month'] = release_dates['released_date'].dt.month
Alternately, you can use purely the fundamental operations provided by Pandas and as shown in @Vivek Kalyanarangan's answer - the same kind of thing that you do with .dt.year
and .dt.month
, but to solve the entire problem. .str
works like .dt
(but you get strings instead of Datetime objects), and it offers replace
, split
and strip
methods that work like the corresponding string method (just applying it to every string in a Series). This is still a Series, so [0]
gives you one of the entries, instead of giving you the first character of each string - for that, you need .str[0]
as shown.
CodePudding user response:
released -> released_date, released_year, released_month, (released_country)
import re
pattern=r"\s*([a-z|A-Z] )\s*([0-9]{1,2})\s*,\s*([0-9]{4})\s*\((\s*[a-z|A-Z] \s*)\)\s*"
released_month, released_date, released_year, released_country = re.search(pattern, released).groups()
I think I know why you want Lambda but again I couldn't find the need of it.
Lambda is used frequently in following situations.
When a function, method takes a function as an argument but you can't be bothered to defined that function just for using it as argument. Example would be
a.sort(key=lambda x: x[3])
When a function is used just once and you can't be bothered to define it. Example would be
new_list = list(map(lambda x: x**2, old_list))
So anyway my code code will parse "May 11, 2001 (Canada)" format into 4 strings.
- May
- 11
- 2001
- Canada
and then assign each to released_date/year/month/country variables.
You can further apply pandas functions to make those strings represent some datetime concept
To see how regex works, go to https://regexr.com/
and copy paste my "pattern" there