Pandas - Two questions - Creating columns and replacing values based on search result-CodePudding

#!/usr/bin/env python3

import pandas
import numpy

example_dataset = {
    'Date' : ['01 Mar 2022', '02 Apr 2022', '10 Apr 2022', '15 Apr 2022'],
    'Transaction Type' : ['Contactless payment', 'Payment to', 'Contactless payment', 'Contactless payment'],
    'Description' : ['Tesco Store', 'Dentist', 'Cinema', 'Sainsburys'],
    'Amount' : ['156.00', '55', '21.50', '176.10']
}

df = pandas.DataFrame(example_dataset)

df ['Date'] = pandas.to_datetime(df['Date'], format='%d %b %Y')
df['Category'] = 'tempvalue'

df['Category'] = numpy.where(df['Description'].str.contains('Tesco|Sainsbury'), 'Groceries', df['Category'])
df['Category'] = numpy.where(df['Description'].str.contains('Dentist|Cinema'), 'Stuff', df['Category'])

print (df)

Given the above code I have two related questions please:

Is there a better way to create the Category column than with the temp value and later replacing it with specific values as shown? I ask as it feels messy.
How could I have the terms to search for and the category to assign in a separate file? Is that possible? I ask as I want to make it easy for myself to add new terms and define the categories in the future.

Thank you

CodePudding user response：

1. First Question

you don't need to pre-create the new column, you could do something like this:


#df['Category'] = 'tempvalue'

df['Category'] = numpy.where(df['Description'].str.contains('Tesco|Sainsbury'), 'Groceries',numpy.nan)
df['Category'] = numpy.where(df['Description'].str.contains('Dentist|Cinema'), 'Stuff',df['Category'])

2. Second Question

let's create a simple key-value file called categories.json in the same directory of your script

{
    "Tesco|Sainsbury":"Groceries",
    "Dentist|Cinema":"Stuff"
}

you could do something like this to automate the categories assignement

import pandas
import numpy
import json

example_dataset = {
    'Date' : ['01 Mar 2022', '02 Apr 2022', '10 Apr 2022', '15 Apr 2022'],
    'Transaction Type' : ['Contactless payment', 'Payment to', 'Contactless payment', 'Contactless payment'],
    'Description' : ['Tesco Store', 'Dentist', 'Cinema', 'Sainsburys'],
    'Amount' : ['156.00', '55', '21.50', '176.10']
}

df = pandas.DataFrame(example_dataset)

df ['Date'] = pandas.to_datetime(df['Date'], format='%d %b %Y')



with open('categories.json') as file:
    categories_dict = json.load(file)

df['Category'] = numpy.nan
for key,value in categories_dict.items():
    df['Category'] = numpy.where(df['Description'].str.contains(key), value,df['Category'])

In this scenario I suggest to mantain the column initialization for simplicity