Home > other >  Extracting from Pandas Column with regex pattern
Extracting from Pandas Column with regex pattern

Time:09-09

I have a Pandas Dataframe with the following structure:

pd.DataFrame([None, '1 RB, 2 TE, 2 WR', '1 RB, 1 TE, 3 WR', '1 RB, 3 TE, 1 WR', '1 RB, 0 TE, 4 WR', '2 RB, 1 TE, 2 WR', '2 RB, 2 TE, 1 WR', '1 RB, 2 TE, 1 WR,1 P,2 LB,1 LS,3 DB', '6 OL, 2 RB, 2 TE, 0 WR'])
RB
None
1 RB, 2 TE, 2 WR
1 RB, 1 TE, 3 WR
1 RB, 1 TE, 3 WR
1 RB, 0 TE, 4 WR

Ideally, I would prefer to split the column into the following format:

RB TE WR P LB LS DB OL
0 0 0 0 0 0 0 0
1 2 2 0 0 0 0 0
1 1 3 0 0 0 0 0
1 3 1 0 0 0 0 0
1 0 4 0 0 0 0 0

Where each of the original column values is parsed based on the label ("1 RB" would be the value 1 in the column "RB"). The pattern will always be [# position].

How would I accomplish this? Each column value in the original dataframe column is one long string, so it isn't already an array or something. Additionally, not every value in the original dataframe column follows the same order; i.e. there isn't a common pattern in the order of RB, TE, WR-- if there isn't a value, the string does not include "0 WR" for example.

CodePudding user response:

try this:

def make_dict(g: pd.DataFrame):
    res = dict(g.values[:,[-1,0]])
    return res

grouped = df[0].str.extractall(r'(\d )\s(\w )').groupby(level=0)
tmp = grouped.apply(make_dict)
result = pd.DataFrame([*tmp], index=tmp.index).reindex(df.index)
print(result)

>>>

    RB  TE  WR  P   LB  LS  DB  OL
0   NaN NaN NaN NaN NaN NaN NaN NaN
1   1   2   2   NaN NaN NaN NaN NaN
2   1   1   3   NaN NaN NaN NaN NaN
3   1   3   1   NaN NaN NaN NaN NaN
4   1   0   4   NaN NaN NaN NaN NaN
5   2   1   2   NaN NaN NaN NaN NaN
6   2   2   1   NaN NaN NaN NaN NaN
7   1   2   1   1   2   1   3   NaN
8   2   2   0   NaN NaN NaN NaN 6

CodePudding user response:

Here is a step by step process to do it assuming that the pattern is # position, # position ...

import pandas as pd
df = pd.DataFrame([None, '1 RB, 2 TE, 2 WR', '1 RB, 1 TE, 3 WR', '1 RB, 3 TE, 1 WR', '1 RB, 0 TE, 4 WR', '2 RB, 1 TE, 2 WR', '2 RB, 2 TE, 1 WR', '1 RB, 2 TE, 1 WR,1 P,2 LB,1 LS,3 DB', '6 OL, 2 RB, 2 TE, 0 WR'])

# create a list of dictionaries
rows = []
for i, r in df.iterrows():
    data = r[0]
    try:
        # assuming the items are comma separated
        items = data.split(',')
    except:
        # ignore data like None
        continue

    row = {}
    for item in items:
        # pattern: # position
        value, key = item.strip().split()
        row[key] = value
        rows.append(row)
 

# convert list of dictionaries to dataframe
new_df = pd.DataFrame(rows)
print(new_df)
  • Related