for instance the column i want to split is duration here, it has data points like - 110 or 2 seasons, i want to make a differerent column for seasons and in place of seasons in my current column it should say null as this would make the type of column int from string screenshot of my data
i tried the split function but that's for splliting in between data points, unlike splitting different other data points
CodePudding user response:
I have tried to replicate a portion of your dataframe in order to provide the below solution - note that it will also change the np.NaN values to 'Null' as requested.
Creating the sample dataframe off of your screenshot:
movies_dic = {'release_year': [2021,2020,2021,2021,2021,1940,2018,2008,2021],
'duration':[np.NaN, 94, 108, 97, 104, 60, '4 Seasons', 90, '1 Season']}
stack_df = pd.DataFrame(movies_dic)
stack_df
The issue is likely that the 'duration' column is of object dtypes - namely it contains both string and integer values in it. I have made 2 small functions that will make use of the data types and allocate them to their respective column. The first is taking all the 'string' rows and placing them in the 'series_duration' column:
def series(x):
if type(x) == str:
return x
else:
return 'Null'
Then the movies function keeps the integer values (i.e. those without the word 'Season' in them) as is:
def movies(x):
if type(x) == int:
return x
else:
return 'Null'
stack_df['series_duration'] = stack_df['duration'].apply(lambda x: series(x))
stack_df['duration'] = stack_df['duration'].apply(lambda x: movies(x))
stack_df
release_year duration series_duration
0 2021 Null Null
1 2020 94 Null
2 2021 108 Null
3 2021 97 Null
4 2021 104 Null
5 1940 60 Null
6 2018 Null 4 Seasons
7 2008 90 Null
8 2021 Null 1 Season
CodePudding user response:
I have created an example to give you some ideas about how to manage the problem.
First of all, I created a DF with ints, strings with format:' X seasons' and negative numbers:
import pandas as pd
data = [5,4,3,4,5,6,'4 seasons', -110, 10]
df = pd.DataFrame(data, columns=['Numbers'])
Then I created the next loop, what it does is to create new columns depending the format of the value (string or negative number), insert them and transform the original value into an NaN.
index=0
for n in df['Numbers']:
if type(n)==str:
df.loc[index, 'Seasons'] = n
df['Numbers'] = df['Numbers'].replace([n], np.nan)
elif n < 0:
df.loc[index, 'Negatives'] = n
df['Numbers'] = df['Numbers'].replace([n], np.nan)
index =1
The output would be like this:
Numbers Seasons Negatives
0 5.0 NaN NaN
1 4.0 NaN NaN
2 3.0 NaN NaN
3 4.0 NaN NaN
4 5.0 NaN NaN
5 6.0 NaN NaN
6 NaN 4 seasons NaN
7 NaN NaN -110.0
8 10.0 NaN NaN
Then you can adjust this example as you want.