How to "translate" string into an integer in python?-CodePudding

I have the following df:

print(df)
>>>
Marital Status       Income     Education 
  Married             66613       PhD  
  Married             12441       Bachelors 
  Single              52842       Masters Degree
  Relationship        78238       PhD
  Divorced            21242       High School
  Single              47183       Masters Degree

I'd like to convert every "String" to a corresponding number (int). E.g.

"Married" should be 1

"Single" 2

"Relationship" 3

and so on.

I still haven't tried any code yet since I haven't found any reasonable solution after googling for around 1 hour now, but I am sure that the solution is most likely incredibly simple.

Edit: grammar

CodePudding user response：

This may help you to get what you need.

df['Marital Status'] = df['Marital Status'].astype('category').cat.codes

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html

CodePudding user response：

It's exactly what pd.factorize do:

df['Marital Code'] = pd.factorize(df['Marital Status'])[0]   1
print(df)

# Output

  Marital Status  Income       Education  Marital Code
0        Married   66613             PhD             1
1        Married   12441       Bachelors             1
2         Single   52842  Masters Degree             2
3   Relationship   78238             PhD             3
4       Divorced   21242     High School             4
5         Single   47183  Masters Degree             2

CodePudding user response：

Another solution, using .map:

df["Marital Status"] = df["Marital Status"].map(
    {"Married": 1, "Single": 2, "Relationship": 3, "Divorced": 4, "Single": 5}
)

print(df)

Prints:

   Marital Status  Income       Education
0               1   66613             PhD
1               1   12441       Bachelors
2               5   52842  Masters Degree
3               3   78238             PhD
4               4   21242     High School
5               5   47183  Masters Degree

CodePudding user response：

Another Map solution for a little more readability and control if you wanted to add more later

df_map = pd.DataFrame({
    'Text' : ['Married', 'Single', 'Relationship'],
    'Int_Conversion' : [1, 2, 3]
})

df['Education'] = df['Marital'].map(df_map.set_index('Text')['Int_Conversion'])

CodePudding user response：

One approach using categories that will work independent of the data:

categories = pd.CategoricalDtype(categories=["Married", "Single", "Relationship", "Divorced"], ordered=True)
df["result"] = df["Marital Status"].astype(categories).cat.codes   1
print(df)

Output

  Marital Status  Income       Education  result
0        Married   66613             PhD       1
1        Married   12441       Bachelors       1
2         Single   52842  Masters Degree       2
3   Relationship   78238             PhD       3
4       Divorced   21242     High School       4
5         Single   47183  Masters Degree       2

This approach is suggested by the documentation to control the behavior, quote (emphasis mine):

In the examples above where we passed dtype='category', we used the default behavior:

Categories are inferred from the data.

Categories are unordered.

To control those behaviors, instead of passing 'category', use an instance of CategoricalDtype.

CodePudding user response：

By corresponding number. Do you have a specific numbering scheme in mind, or just any number as long as the same string gets the same number assigned?

If the latter, then this code should work.

def replace_words(text):
    next_number = 1
    word_map = {}
    
    def get_number(word):
      nonlocal next_number, word_map
      if word in word_map:
        return word_map[word]
      word_map[word] = next_number
      next_number = next_number   1
      return next_number - 1
    
    words = text.split(" ")
    replaced_words = [get_number(x) for x in words]
    return " ".join([str(x) for x in replaced_words])
    
print(replace_words("some words some thoughts"))