I would like to create a new column called "season_new", where I want to maintain the non-null season and extract the season for null values from the programme name. My dataframe is something like this:
programme | season |
---|---|
grey's anatomy s1 | null |
friends season 1 | 1 |
grey's anatomy s2 | null |
big bang theory s2 | 2 |
big bang theory | 1 |
peaky blinders | 1 |
I'd try using regex.
dt['season_new'] = dt['programme'].str.extract(r'(season\s?\d |s\s?\d )')
But it gave me this result:
programme | season | season_new |
---|---|---|
grey's anatomy s1 | null | 1 |
friends season 1 | 1 | 1 |
grey's anatomy s2 | null | 2 |
big bang theory s2 | 2 | 2 |
big bang theory | 1 | null |
peaky blinders | 1 | null |
The result that I expected is:
programme | season | season_new |
---|---|---|
grey's anatomy s1 | null | 1 |
friends season 1 | 1 | 1 |
grey's anatomy s2 | null | 2 |
big bang theory s2 | 2 | 2 |
big bang theory | 1 | 1 |
peaky blinders | 1 | 1 |
CodePudding user response:
When trying your code, for some reason the regex didn't return only the integers:
0 grey's anatomy s1 NaN s1
1 friends season 1 1.0 season 1
2 grey's anatomy s2 NaN s2
3 big bang theory s2 2.0 s2
4 big bang theory 1.0 NaN
5 peaky blinders 1.0 NaN
I am not so great at regex so looked into another option which is below.
df = pd.read_excel(source_file)
# Empty list for data capture
season_data = []
# Loop thought all rows
for idx in df.index:
# Grab value to check
check_val = df["season"][idx]
# If value is not null then keep it
if pd.notnull(check_val):
# Add value to list
season_data.append(int(check_val))
else:
# Extract digits from programme description
extract_result = "".join(i for i in df["programme"][idx] if i.isdigit())
# Add value to list
season_data.append(extract_result)
# Add full list to dataframe
df["season_new"] = season_data
print(df)
Result is:
programme season season_new
0 grey's anatomy s1 NaN 1
1 friends season 1 1.0 1
2 grey's anatomy s2 NaN 2
3 big bang theory s2 2.0 2
4 big bang theory 1.0 1
5 peaky blinders 1.0 1
CodePudding user response:
I think that the easiest way to do this is using the apply()
method. I also used Regex
I first tried this, using a piece of your code:
data['season_new'] = data.apply(lambda x: x.season if pd.notna(x.season) else re.search(r'(season\s?\d |s\s?\d )',x.programme).group(1), axis=1)
The output was this:
programme season season_new
0 grey's anatomy s1 NaN s1
1 friends season 1 1.0 1.0
2 grey's anatomy s2 NaN s2
3 big bang theory s2 2.0 2.0
4 big bang theory 1.0 1.0
5 peaky blinders 1.0 1.0
As we can see the column season_new is not a 100% correct. So i tried in another way:
data['season_new'] = data.apply(lambda x: x.season if pd.notna(x.season) else (x.programme[-1] if x.programme[-1].isdigit() else np.nan), axis=1).astype('int')
The expected output:
programme season season_new
0 grey's anatomy s1 NaN 1
1 friends season 1 1.0 1
2 grey's anatomy s2 NaN 2
3 big bang theory s2 2.0 2
4 big bang theory 1.0 1
5 peaky blinders 1.0 1
CodePudding user response:
You can use pandas.Series.fillna
since this one accepts Series as a value
.
value: scalar, dict, Series, or DataFrame
Try this :
dt['season_new'] = (
dt['programme']
.str.extract(r'[season\s?|s](\d )', expand=False)
.fillna(dt['season'])
.astype(int)
)
If you want to remove the old season, use pandas.Series.pop
:
dt['season_new'] = (
dt['programme']
.str.extract(r'[season\s?|s](\d )', expand=False)
.fillna(dt.pop('season'))
.astype(int)
)
# Output :
print(dt)
programme season_new
0 grey's anatomy s1 1
1 friends season 1 1
2 grey's anatomy s2 2
3 big bang theory s2 2
4 big bang theory 1
5 peaky blinders 1
CodePudding user response:
use following code:
pat = r'[season|s]\s?(\d $)'
df.assign(season_new=df['season'].fillna(df['programme'].str.extract(pat)[0]))
result:
programme season season_new
grey's anatomy s1 NaN 1
friends season 1 1 1
grey's anatomy s2 NaN 2
big bang theory s2 2 2
big bang theory 1 1
peaky blinders 1 1