I have a df
technologies= {
'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print(df)
Im trying to replace column values with dict values
dict = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2=df.replace({"Courses": dict})
print(df2)
but the rows with seperator , is not getting replaced even though there is values present Getting this as output
Courses Fee Duration Discount
0 Spark,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
but the output should be
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
CodePudding user response:
It's probably worth learning about how the regex parameter works so that you can leverage it in the future. None the less it is possible to split on the ,
and explode so that you have one word per row. Then you can replace and groupby the original index and join back to a comma separated string.
import pandas as pd
technologies= {
'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
d = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = (df.Courses.str.split(',').explode().replace(d)
.groupby(level=0).agg(','.join))
Output
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
CodePudding user response:
Method 1: Ensure all the compound words are before the single words. in the dictionary PySpark
is before Spark
d = {"PySpark" : 'P', "Spark" : 'S', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2 = df.replace({"Courses": d}, regex = True)
print(df2)
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
Method 2: Put the words in Boundary:
new_dict = pd.DataFrame(d.items(), columns = ['keys', 'values'])
new_dict['keys'] = '\\b' new_dict['keys'] '\\b'
new_dict = new_dict.set_index('keys').to_dict()['values']
df3 = df.replace({"Courses": new_dict}, regex = True)
df3
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
CodePudding user response:
Here's a way to do it that focuses on the column you want to change (Courses
):
dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
lambda x: x.str.split(',')).transform(
lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')
Explanation:
- use
transform
to replace each csv string value in the column with a list - use
transform
again, this time to replace each item in a value's list using the dictionarydct
- use
Series.str.join
to convert each value's list back to a csv string.
Full test code:
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print(df)
dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
lambda x: x.str.split(',')).transform(
lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')
print(df)
Input:
Courses Fee Duration Discount
0 Spark,ABCD 22000 30days 1000
1 PySpark 25000 50days 2300
2 Hadoop 23000 30days 1000
3 Python 24000 None 1200
4 Pandas 26000 NaN 2500
Output:
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500