Home > Enterprise >  Pandas - Two questions - Creating columns and replacing values based on search result
Pandas - Two questions - Creating columns and replacing values based on search result

Time:04-29

#!/usr/bin/env python3

import pandas
import numpy

example_dataset = {
    'Date' : ['01 Mar 2022', '02 Apr 2022', '10 Apr 2022', '15 Apr 2022'],
    'Transaction Type' : ['Contactless payment', 'Payment to', 'Contactless payment', 'Contactless payment'],
    'Description' : ['Tesco Store', 'Dentist', 'Cinema', 'Sainsburys'],
    'Amount' : ['156.00', '55', '21.50', '176.10']
}

df = pandas.DataFrame(example_dataset)

df ['Date'] = pandas.to_datetime(df['Date'], format='%d %b %Y')
df['Category'] = 'tempvalue'

df['Category'] = numpy.where(df['Description'].str.contains('Tesco|Sainsbury'), 'Groceries', df['Category'])
df['Category'] = numpy.where(df['Description'].str.contains('Dentist|Cinema'), 'Stuff', df['Category'])

print (df)

Given the above code I have two related questions please:

  1. Is there a better way to create the Category column than with the temp value and later replacing it with specific values as shown? I ask as it feels messy.
  2. How could I have the terms to search for and the category to assign in a separate file? Is that possible? I ask as I want to make it easy for myself to add new terms and define the categories in the future.

Thank you

CodePudding user response:

1. First Question

you don't need to pre-create the new column, you could do something like this:


#df['Category'] = 'tempvalue'

df['Category'] = numpy.where(df['Description'].str.contains('Tesco|Sainsbury'), 'Groceries',numpy.nan)
df['Category'] = numpy.where(df['Description'].str.contains('Dentist|Cinema'), 'Stuff',df['Category'])

2. Second Question

let's create a simple key-value file called categories.json in the same directory of your script

{
    "Tesco|Sainsbury":"Groceries",
    "Dentist|Cinema":"Stuff"
}

you could do something like this to automate the categories assignement

import pandas
import numpy
import json

example_dataset = {
    'Date' : ['01 Mar 2022', '02 Apr 2022', '10 Apr 2022', '15 Apr 2022'],
    'Transaction Type' : ['Contactless payment', 'Payment to', 'Contactless payment', 'Contactless payment'],
    'Description' : ['Tesco Store', 'Dentist', 'Cinema', 'Sainsburys'],
    'Amount' : ['156.00', '55', '21.50', '176.10']
}

df = pandas.DataFrame(example_dataset)

df ['Date'] = pandas.to_datetime(df['Date'], format='%d %b %Y')



with open('categories.json') as file:
    categories_dict = json.load(file)

df['Category'] = numpy.nan
for key,value in categories_dict.items():
    df['Category'] = numpy.where(df['Description'].str.contains(key), value,df['Category'])

In this scenario I suggest to mantain the column initialization for simplicity

  • Related