Home > Mobile >  df replace is not working with seperator in pandas column
df replace is not working with seperator in pandas column

Time:06-01

I have a df

technologies= {
    'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print(df)

Im trying to replace column values with dict values

dict = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2=df.replace({"Courses": dict})
print(df2)

but the rows with seperator , is not getting replaced even though there is values present Getting this as output

      Courses    Fee Duration  Discount
0  Spark,ABCD  22000   30days      1000
1           P  25000   50days      2300
2           H  23000   30days      1000
3           P  24000     None      1200
4           P  26000      NaN      2500

but the output should be

      Courses    Fee Duration  Discount
0      S,ABCD 22000   30days      1000
1           P  25000   50days      2300
2           H  23000   30days      1000
3           P  24000     None      1200
4           P  26000      NaN      2500

CodePudding user response:

It's probably worth learning about how the regex parameter works so that you can leverage it in the future. None the less it is possible to split on the , and explode so that you have one word per row. Then you can replace and groupby the original index and join back to a comma separated string.

import pandas as pd
technologies= {
    'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)

d = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}

df.Courses = (df.Courses.str.split(',').explode().replace(d)
                        .groupby(level=0).agg(','.join))

Output

  Courses    Fee Duration  Discount
0  S,ABCD  22000   30days      1000
1       P  25000   50days      2300
2       H  23000   30days      1000
3       P  24000     None      1200
4       P  26000      NaN      2500

CodePudding user response:

Method 1: Ensure all the compound words are before the single words. in the dictionary PySpark is before Spark

d = {"PySpark" : 'P', "Spark" : 'S', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2 = df.replace({"Courses": d}, regex  = True)
print(df2)

  Courses    Fee Duration  Discount
0  S,ABCD  22000   30days      1000
1       P  25000   50days      2300
2       H  23000   30days      1000
3       P  24000     None      1200
4       P  26000      NaN      2500

Method 2: Put the words in Boundary:

new_dict = pd.DataFrame(d.items(), columns = ['keys', 'values'])
new_dict['keys'] = '\\b'   new_dict['keys']   '\\b'
new_dict = new_dict.set_index('keys').to_dict()['values']
df3 = df.replace({"Courses": new_dict}, regex  = True)
df3

 Courses    Fee Duration  Discount
    0  S,ABCD  22000   30days      1000
    1       P  25000   50days      2300
    2       H  23000   30days      1000
    3       P  24000     None      1200
    4       P  26000      NaN      2500

CodePudding user response:

Here's a way to do it that focuses on the column you want to change (Courses):

dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
    lambda x: x.str.split(',')).transform(
    lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')

Explanation:

  • use transform to replace each csv string value in the column with a list
  • use transform again, this time to replace each item in a value's list using the dictionary dct
  • use Series.str.join to convert each value's list back to a csv string.

Full test code:

import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print(df)

dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
    lambda x: x.str.split(',')).transform(
    lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')
print(df)

Input:

      Courses    Fee Duration  Discount
0  Spark,ABCD  22000   30days      1000
1     PySpark  25000   50days      2300
2      Hadoop  23000   30days      1000
3      Python  24000     None      1200
4      Pandas  26000      NaN      2500

Output:

  Courses    Fee Duration  Discount
0  S,ABCD  22000   30days      1000
1       P  25000   50days      2300
2       H  23000   30days      1000
3       P  24000     None      1200
4       P  26000      NaN      2500
  • Related