Home > Enterprise >  Pandas look for substring then write in another
Pandas look for substring then write in another

Time:11-13

So I'm trying to look down a specific column of my csv file for a partial string. If that meets a certain condition, it'll write something else in a different column.

For example:

Letter Grade     Percentage
     A        Ninety Five Percent
     C        Seventy Three Percent
     B        Eighty Two Percent

The "Percentage" column will always have the same format of "Ninety Five Percent" that is number in words Percent.

I want to be able to open up a csv file, have python parse the percentage column for the first word such if it contains "Ninety" then it'll assign "A" in the letter grade column, if it contains "Eighty" then B in the letter grade and so on and so forth.

This is my current code:

import pandas as pd

df = pd.DataFrame(pd.read_csv(r'file.csv', dtype=str))

A = 'Ninety'
B = 'Eighty'
C = 'Seventy'
D = 'Sixty'

if df[df['Percentage'].isin(A)]:
     df['Letter Grade'] = df['Letter Grade'].str.replace['', 'A', regex=False]
elif df[df['Percentage'].isin(B)]:
     df['Letter Grade'] = df['Letter Grade'].str.replace['', 'B', regex=False]
elif df[df['Percentage'].isin(C)]:
     df['Letter Grade'] = df['Letter Grade'].str.replace['', 'C', regex=False]
elif df[df['Percentage'].isin(D)]:
     df['Letter Grade'] = df['Letter Grade'].str.replace['', 'D', regex=False]
else:
     df['Letter Grade'] = df['Letter Grade'].str.replace['', 'F', regex=False]


df.to_csv(r'file.csv', index=False)

It is currently giving me this error: The truth value of a DataFrame is ambiguous.

CodePudding user response:

You can use dictionary with split and map function to do that

grades={'Ninety':'A', 'Eighty':'B', 'Seventy':'C', 'Sixty':'D'}

df['Letter Grade'] = df.Percentage.str.split(expand=True)[0].map(grades)

CodePudding user response:

Try with numpy.select:

import numpy as np

df["Letter Grade"] = np.select([df["Percentage"].str.contains(A),
                                df["Percentage"].str.contains(B), 
                                df["Percentage"].str.contains(C),
                                df["Percentage"].str.contains(D)], 
                               ["A","B","C","D"])

Alternatively with findall and map:

mapper = {"Ninety": "A", "Eighty": "B", "Seventy": "C", "Sixty": "D"}
df["Letter Grade"] = df["Percentage"].str.findall("|".join(mapper.keys())).str[0].map(mapper)

CodePudding user response:

@not_speshal's answer is great, but it doesn't actually explain the precise issue that you're having.

isin actually returns a dataframe listing each column and whether or not the string was found as a whole item (e.g., it will find 'Ninenty' but not 'Ninenty x').

Instead, you can use .str.contains(...) on the column, which will return a copy of the column where each item is True if the string in the columns contained your string, False other wise. Then you can call .any() on the resulting series which will return True if a single item in the series is True, and it will return False if there are no True values in the series (there were no strings that matched).

Instead of

if df[df['Percentage'].isin(A)]:

you want:

if df['Percentage'].str.contains('Ninety').any():
  • Related