df replace is not working with seperator in pandas column-CodePudding

I have a df

technologies= {
    'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print(df)

Im trying to replace column values with dict values

dict = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2=df.replace({"Courses": dict})
print(df2)

but the rows with seperator , is not getting replaced even though there is values present Getting this as output

      Courses    Fee Duration  Discount
0  Spark,ABCD  22000   30days      1000
1           P  25000   50days      2300
2           H  23000   30days      1000
3           P  24000     None      1200
4           P  26000      NaN      2500

but the output should be

      Courses    Fee Duration  Discount
0      S,ABCD 22000   30days      1000
1           P  25000   50days      2300
2           H  23000   30days      1000
3           P  24000     None      1200
4           P  26000      NaN      2500

CodePudding user response：

It's probably worth learning about how the regex parameter works so that you can leverage it in the future. None the less it is possible to split on the , and explode so that you have one word per row. Then you can replace and groupby the original index and join back to a comma separated string.

import pandas as pd
technologies= {
    'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)

d = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}

df.Courses = (df.Courses.str.split(',').explode().replace(d)
                        .groupby(level=0).agg(','.join))

Output

  Courses    Fee Duration  Discount
0  S,ABCD  22000   30days      1000
1       P  25000   50days      2300
2       H  23000   30days      1000
3       P  24000     None      1200
4       P  26000      NaN      2500

CodePudding user response：

Method 1: Ensure all the compound words are before the single words. in the dictionary PySpark is before Spark

d = {"PySpark" : 'P', "Spark" : 'S', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2 = df.replace({"Courses": d}, regex  = True)
print(df2)

  Courses    Fee Duration  Discount
0  S,ABCD  22000   30days      1000
1       P  25000   50days      2300
2       H  23000   30days      1000
3       P  24000     None      1200
4       P  26000      NaN      2500

Method 2: Put the words in Boundary:

new_dict = pd.DataFrame(d.items(), columns = ['keys', 'values'])
new_dict['keys'] = '\\b'   new_dict['keys']   '\\b'
new_dict = new_dict.set_index('keys').to_dict()['values']
df3 = df.replace({"Courses": new_dict}, regex  = True)
df3

 Courses    Fee Duration  Discount
    0  S,ABCD  22000   30days      1000
    1       P  25000   50days      2300
    2       H  23000   30days      1000
    3       P  24000     None      1200
    4       P  26000      NaN      2500

CodePudding user response：

Here's a way to do it that focuses on the column you want to change (Courses):

dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
    lambda x: x.str.split(',')).transform(
    lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')

Explanation:

use transform to replace each csv string value in the column with a list
use transform again, this time to replace each item in a value's list using the dictionary dct
use Series.str.join to convert each value's list back to a csv string.

Full test code:

import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print(df)

dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
    lambda x: x.str.split(',')).transform(
    lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')
print(df)

Input:

      Courses    Fee Duration  Discount
0  Spark,ABCD  22000   30days      1000
1     PySpark  25000   50days      2300
2      Hadoop  23000   30days      1000
3      Python  24000     None      1200
4      Pandas  26000      NaN      2500

Output:

  Courses    Fee Duration  Discount
0  S,ABCD  22000   30days      1000
1       P  25000   50days      2300
2       H  23000   30days      1000
3       P  24000     None      1200
4       P  26000      NaN      2500