Home > Mobile >  Pandas dataframe: select list items in a column, then transform string on the items
Pandas dataframe: select list items in a column, then transform string on the items

Time:06-07

One of the columns I'm importing into my dataframe is structured as a list. I need to pick out certain values from said list, transform the value and add it to one of two new columns in the dataframe. Before:

Name Listed_Items
Tom ["dr_md_coca_cola", "dr_od_water", "potatoes", "grass", "ot_other_stuff"]
Steve ["dr_od_orange_juice", "potatoes", "grass", "ot_other_stuff", "dr_md_pepsi"]
Phil ["dr_md_dr_pepper", "potatoes", "grass", "dr_od_coffee","ot_other_stuff"]

From what I've read I can turn the column into a list

df["listed_items"] = df["listed_items"].apply(eval)

But then I cannot see how to find any list items that start dr_md, extract the item, remove the starting dr_md, replace any underscores, capitalize the first letter and add that to a new MD column in the row. Then same again for dr_od. There is only one item in the list that starts dr_md and dr_od in each row. Desired output

Name MD OD
Tom Coca Cola Water
Steve Pepsi Orange Juice
Phil Dr Pepper Coffee

CodePudding user response:

What you need to do is make a function that does the processing for you that you can pass into apply (or in this case, map). Alternatively, you could expand your list column into multiple columns and then process them afterwards, but that will only work if your lists are always in the same order (see panda expand columns with list into multiple columns). Because you only have one input column, you could use map instead of apply.

def process_dr_md(l:list):
    for s in l:
        if s.startswith("dr_md_"):
            # You can process your string further here
            return l[6:]

def process_dr_od(l:list):
    for s in l:
        if s.startswith("dr_od_"):
            # You can process your string further here
            return l[6:]

df["listed_items"] = df["listed_items"].map(eval)
df["MD"] = df["listed_items"].map(process_dr_md)
df["OD"] = df["listed_items"].map(process_dr_od)

I hope that gets you on your way!

CodePudding user response:

Use pivot_table

df = df.explode('Listed_Items')
df = df[df.Listed_Items.str.contains('dr_')]

df['Type'] = df['Listed_Items'].str.contains('dr_md').map({True: 'MD', 
                                                           False: 'OD'})

df.pivot_table(values='Listed_Items', 
               columns='Type', 
               index='Name',
               aggfunc='first')

Type                MD                  OD
Name                                      
Phil   dr_md_dr_pepper        dr_od_coffee
Steve      dr_md_pepsi  dr_od_orange_juice
Tom    dr_md_coca_cola         dr_od_water

From here it's just a matter of beautifying your dataset as your wish.

CodePudding user response:

I took a slightly different approach from the previous answers. given a df of form:

    Name    Items
0   Tom [dr_md_coca_cola, dr_od_water, potatoes, grass...
1   Steve   [dr_od_orange_juice, potatoes, grass, ot_other...
2   Phil    [dr_md_dr_pepper, potatoes, grass, dr_od_coffe...  

and making the following assumptions:

  1. only one item in a list matches the target mask
  2. the target mask always appears at the start of the entry string

I created the following function to parse the list:

import re
def parse_Items(tgt_mask: str, itmList: list) -> str:
    p = re.compile(tgt_mask)
    for itm in itmList:
        if p.match(itm):
            return itm[p.search(itm).span()[1]:].replace('_', ' ')  

Then you can modify your original data farme by use of the following:

df['MD'] = [parse_Items('dr_md_', x) for x in df['Items'].to_list()]
df['OD'] = [parse_Items('dr_od_', x) for x in df['Items'].to_list()]
df.pop('Items')  

This produces the following:

    Name    MD          OD
0   Tom     coca cola   water
1   Steve   pepsi       orange juice
2   Phil    dr pepper   coffee

CodePudding user response:

I would normalize de data before to put in a dataframe:

import pandas as pd
from typing import Dict, List, Tuple


def clean_stuff(text: str):
    clean_text = text[6:].replace('_', ' ')
    return " ".join([
        word.capitalize()
        for word in clean_text.split(" ")
    ])


def get_md_od(stuffs: List[str]) -> Tuple[str, str]:
    md_od = [s for s in stuffs if s.startswith(('dr_md', 'dr_od'))]
    md_od = sorted(md_od)
    print(md_od)

    return clean_stuff(md_od[0]), clean_stuff(md_od[1])


dirty_stuffs = [{'Name': 'Tom',
                 'Listed_Items': ["dr_md_coca_cola",
                                  "dr_od_water",
                                  "potatoes",
                                  "grass",
                                  "ot_other_stuff"]},
                {'Name': 'Tom',
                 'Listed_Items': ["dr_md_coca_cola",
                                  "dr_od_water",
                                  "potatoes",
                                  "grass",
                                  "ot_other_stuff"]}
                ]

normalized_stuff: List[Dict[str, str]] = []
for stuff in dirty_stuffs:
    md, od = get_md_od(stuff['Listed_Items'])
    normalized_stuff.append({
        'Name': stuff['Name'],
        'MD': md,
        'OD': od,
    })

df = pd.DataFrame(normalized_stuff)
print(df)
  • Related