I have the following df:
print(df)
>>>
Marital Status Income Education
Married 66613 PhD
Married 12441 Bachelors
Single 52842 Masters Degree
Relationship 78238 PhD
Divorced 21242 High School
Single 47183 Masters Degree
I'd like to convert every "String" to a corresponding number (int). E.g.
"Married" should be 1
"Single" 2
"Relationship" 3
and so on.
I still haven't tried any code yet since I haven't found any reasonable solution after googling for around 1 hour now, but I am sure that the solution is most likely incredibly simple.
Edit: grammar
CodePudding user response:
This may help you to get what you need.
df['Marital Status'] = df['Marital Status'].astype('category').cat.codes
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html
CodePudding user response:
It's exactly what pd.factorize
do:
df['Marital Code'] = pd.factorize(df['Marital Status'])[0] 1
print(df)
# Output
Marital Status Income Education Marital Code
0 Married 66613 PhD 1
1 Married 12441 Bachelors 1
2 Single 52842 Masters Degree 2
3 Relationship 78238 PhD 3
4 Divorced 21242 High School 4
5 Single 47183 Masters Degree 2
CodePudding user response:
Another solution, using .map
:
df["Marital Status"] = df["Marital Status"].map(
{"Married": 1, "Single": 2, "Relationship": 3, "Divorced": 4, "Single": 5}
)
print(df)
Prints:
Marital Status Income Education
0 1 66613 PhD
1 1 12441 Bachelors
2 5 52842 Masters Degree
3 3 78238 PhD
4 4 21242 High School
5 5 47183 Masters Degree
CodePudding user response:
Another Map solution for a little more readability and control if you wanted to add more later
df_map = pd.DataFrame({
'Text' : ['Married', 'Single', 'Relationship'],
'Int_Conversion' : [1, 2, 3]
})
df['Education'] = df['Marital'].map(df_map.set_index('Text')['Int_Conversion'])
CodePudding user response:
One approach using categories that will work independent of the data:
categories = pd.CategoricalDtype(categories=["Married", "Single", "Relationship", "Divorced"], ordered=True)
df["result"] = df["Marital Status"].astype(categories).cat.codes 1
print(df)
Output
Marital Status Income Education result
0 Married 66613 PhD 1
1 Married 12441 Bachelors 1
2 Single 52842 Masters Degree 2
3 Relationship 78238 PhD 3
4 Divorced 21242 High School 4
5 Single 47183 Masters Degree 2
This approach is suggested by the documentation to control the behavior, quote (emphasis mine):
In the examples above where we passed dtype='category', we used the default behavior:
Categories are inferred from the data.
Categories are unordered.
To control those behaviors, instead of passing 'category', use an instance of CategoricalDtype.
CodePudding user response:
By corresponding number. Do you have a specific numbering scheme in mind, or just any number as long as the same string gets the same number assigned?
If the latter, then this code should work.
def replace_words(text):
next_number = 1
word_map = {}
def get_number(word):
nonlocal next_number, word_map
if word in word_map:
return word_map[word]
word_map[word] = next_number
next_number = next_number 1
return next_number - 1
words = text.split(" ")
replaced_words = [get_number(x) for x in words]
return " ".join([str(x) for x in replaced_words])
print(replace_words("some words some thoughts"))