how do i split a column into two in python on the basis of data in it-CodePudding

for instance the column i want to split is duration here, it has data points like - 110 or 2 seasons, i want to make a differerent column for seasons and in place of seasons in my current column it should say null as this would make the type of column int from string screenshot of my data

i tried the split function but that's for splliting in between data points, unlike splitting different other data points

CodePudding user response：

I have tried to replicate a portion of your dataframe in order to provide the below solution - note that it will also change the np.NaN values to 'Null' as requested.

Creating the sample dataframe off of your screenshot:

movies_dic = {'release_year': [2021,2020,2021,2021,2021,1940,2018,2008,2021], 
'duration':[np.NaN, 94, 108, 97, 104, 60, '4 Seasons', 90, '1 Season']}
stack_df = pd.DataFrame(movies_dic)
stack_df

The issue is likely that the 'duration' column is of object dtypes - namely it contains both string and integer values in it. I have made 2 small functions that will make use of the data types and allocate them to their respective column. The first is taking all the 'string' rows and placing them in the 'series_duration' column:

def series(x):
    if type(x) == str:
        return x
    else:
        return 'Null'

Then the movies function keeps the integer values (i.e. those without the word 'Season' in them) as is:

def movies(x):
    if type(x) == int:
        return x
    else:
        return 'Null'

stack_df['series_duration'] = stack_df['duration'].apply(lambda x: series(x))

stack_df['duration'] = stack_df['duration'].apply(lambda x: movies(x))
stack_df

release_year    duration    series_duration
0   2021    Null           Null
1   2020    94             Null
2   2021    108            Null
3   2021    97             Null
4   2021    104            Null
5   1940    60             Null
6   2018    Null           4 Seasons
7   2008    90             Null
8   2021    Null           1 Season

CodePudding user response：

I have created an example to give you some ideas about how to manage the problem.

First of all, I created a DF with ints, strings with format:' X seasons' and negative numbers:

import pandas as pd

data = [5,4,3,4,5,6,'4 seasons', -110, 10]
df = pd.DataFrame(data, columns=['Numbers'])

Then I created the next loop, what it does is to create new columns depending the format of the value (string or negative number), insert them and transform the original value into an NaN.

index=0
for n in df['Numbers']:    
    if type(n)==str:
        df.loc[index, 'Seasons'] = n
        df['Numbers'] = df['Numbers'].replace([n],  np.nan)
        
    elif n < 0:
        df.loc[index, 'Negatives'] = n
        df['Numbers'] = df['Numbers'].replace([n],  np.nan)
        
    index =1

The output would be like this:

  Numbers   Seasons Negatives
0   5.0       NaN   NaN
1   4.0       NaN   NaN
2   3.0       NaN   NaN
3   4.0       NaN   NaN
4   5.0       NaN   NaN
5   6.0       NaN   NaN
6   NaN 4 seasons   NaN
7   NaN       NaN   -110.0
8   10.0      NaN   NaN

Then you can adjust this example as you want.