How to assign a "null" value from another column?-CodePudding

I would like to create a new column called "season_new", where I want to maintain the non-null season and extract the season for null values from the programme name. My dataframe is something like this:

programme	season
grey's anatomy s1	null
friends season 1	1
grey's anatomy s2	null
big bang theory s2	2
big bang theory	1
peaky blinders	1

I'd try using regex.

dt['season_new'] = dt['programme'].str.extract(r'(season\s?\d |s\s?\d )')

But it gave me this result:

programme	season	season_new
grey's anatomy s1	null	1
friends season 1	1	1
grey's anatomy s2	null	2
big bang theory s2	2	2
big bang theory	1	null
peaky blinders	1	null

The result that I expected is:

programme	season	season_new
grey's anatomy s1	null	1
friends season 1	1	1
grey's anatomy s2	null	2
big bang theory s2	2	2
big bang theory	1	1
peaky blinders	1	1

CodePudding user response：

When trying your code, for some reason the regex didn't return only the integers:

0   grey's anatomy s1     NaN         s1
1    friends season 1     1.0   season 1
2   grey's anatomy s2     NaN         s2
3  big bang theory s2     2.0         s2
4     big bang theory     1.0        NaN
5      peaky blinders     1.0        NaN

I am not so great at regex so looked into another option which is below.

df = pd.read_excel(source_file)

# Empty list for data capture
season_data = []

# Loop thought all rows
for idx in df.index:

    # Grab value to check
    check_val = df["season"][idx]

    # If value is not null then keep it
    if pd.notnull(check_val):

        # Add value to list
        season_data.append(int(check_val))

    else:
        # Extract digits from programme description
        extract_result = "".join(i for i in df["programme"][idx] if i.isdigit())

        # Add value to list
        season_data.append(extract_result)

# Add full list to dataframe
df["season_new"] = season_data

print(df)

Result is:

            programme  season season_new
0   grey's anatomy s1     NaN          1
1    friends season 1     1.0          1
2   grey's anatomy s2     NaN          2
3  big bang theory s2     2.0          2
4     big bang theory     1.0          1
5      peaky blinders     1.0          1

CodePudding user response：

I think that the easiest way to do this is using the apply() method. I also used Regex

I first tried this, using a piece of your code:

data['season_new'] = data.apply(lambda x: x.season if pd.notna(x.season) else re.search(r'(season\s?\d |s\s?\d )',x.programme).group(1), axis=1)

The output was this:

             programme   season  season_new
0    grey's anatomy s1      NaN          s1
1     friends season 1      1.0         1.0
2    grey's anatomy s2      NaN          s2
3   big bang theory s2      2.0         2.0
4      big bang theory      1.0         1.0
5       peaky blinders      1.0         1.0

As we can see the column season_new is not a 100% correct. So i tried in another way:

data['season_new'] = data.apply(lambda x: x.season if pd.notna(x.season) else (x.programme[-1] if x.programme[-1].isdigit() else np.nan), axis=1).astype('int')

The expected output:

             programme   season  season_new
0    grey's anatomy s1      NaN           1
1     friends season 1      1.0           1
2    grey's anatomy s2      NaN           2
3   big bang theory s2      2.0           2
4      big bang theory      1.0           1
5       peaky blinders      1.0           1

CodePudding user response：

You can use pandas.Series.fillna since this one accepts Series as a value.

value: scalar, dict, Series, or DataFrame

Try this :

dt['season_new'] = (
                    dt['programme']
                        .str.extract(r'[season\s?|s](\d )', expand=False)
                        .fillna(dt['season'])
                        .astype(int)
                    )

If you want to remove the old season, use pandas.Series.pop :

dt['season_new'] = (
                    dt['programme']
                        .str.extract(r'[season\s?|s](\d )', expand=False)
                        .fillna(dt.pop('season'))
                        .astype(int)
                    )

# Output :

print(dt)

            programme  season_new
0   grey's anatomy s1           1
1    friends season 1           1
2   grey's anatomy s2           2
3  big bang theory s2           2
4     big bang theory           1
5      peaky blinders           1

CodePudding user response：

use following code:

pat = r'[season|s]\s?(\d $)'
df.assign(season_new=df['season'].fillna(df['programme'].str.extract(pat)[0]))

result:

programme          season   season_new
grey's anatomy s1   NaN     1
friends season 1    1       1
grey's anatomy s2   NaN     2
big bang theory s2  2       2
big bang theory     1       1
peaky blinders      1       1