pandas best way to process string column (not split)-CodePudding

INTRODUCTION TO PROBLEM

I have data encoded in string in one DataFrame column:

  id             data
0  a   2;0;4208;1;790
1  b     2;0;768;1;47
2  c       2;0;92;1;6
3  d          1;0;341
4  e  3;0;1;2;6;4;132
5  f  3;0;1;1;6;3;492

Data represents count how many times some events happened in out system. We can have 256 different events (each has numerical id assigned from range 0-255). As usually we have only a few events happen in one measurement period is doesn't make sense to store all zeros. That's why data is encoded as follows: first number tells how many events happened during measurement period, then each pair contains event_id and counter.

For example:
"3;0;1;1;6;3;492" means:

3 events happened in measurement period
event with id=0 happened 1 time
event with id=1 happened 6 times
event with id=3 happened 492 time
other events didn't happen

I need to decode the data to separate columns. Expected result is DataFrame which looks like this:

  id  data_0  data_1  data_2  data_3  data_4
0  a  4208.0   790.0     0.0     0.0     0.0
1  b   768.0    47.0     0.0     0.0     0.0
2  c    92.0     6.0     0.0     0.0     0.0
3  d   341.0     0.0     0.0     0.0     0.0
4  e     1.0     0.0     6.0     0.0   132.0
5  f     1.0     6.0     0.0   492.0     0.0

QUESTION ITSELF

I came up with the following function to do it:

def split_data(data: pd.Series):
    tmp = data.str.split(';', expand=True).astype('Int32').fillna(-1)
    tmp = tmp.apply(
        lambda row: {'{0}_{1}'.format(data.name,row[i*2-1]): row[i*2] for i in range(1,row[0] 1)},
        axis='columns',
        result_type='expand').fillna(0)
    return tmp

df = pd.concat([df, split_data(df.pop('data'))], axis=1)

The problem is that I have millions of lines to process and it takes A LOT of time. As I don't have that much experience with pandas, I hope someone would be able to help me with more efficient way of performing this task.

CodePudding user response：

I'd avoid processing this in pandas, assuming you have the data in some other format I'd parse it into lists of dictionaries then load it into pandas.

import pandas as pd
from typing import Dict

data = {
    "a": "2;0;4208;1;790",
    "b": "2;0;768;1;47",
    "c": "2;0;92;1;6",
    "d": "1;0;341",
    "e": "3;0;1;2;6;4;132",
    "f": "3;0;1;1;6;3;492"
}

def get_event_counts(event_str: str, delim: str = ";") -> Dict[str, int]:
    """
    given an event string return a dictionary of events
    """
    EVENT_COUNT_INDEX = 0
    
    split_event = event_str.split(delim)
    event_count = int(split_event[EVENT_COUNT_INDEX])
    
    events = {
        split_event[index*2 1]: int(split_event[index*2 2]) for index in range(event_count - 1 // 2)
    }
    
    return events

data_records = [{"id": k, **get_event_counts(v)} for k,v in data.items()]

print(pd.DataFrame(data_records))

id     0      1    2      4      3
0  a  4208  790.0  NaN    NaN    NaN
1  b   768   47.0  NaN    NaN    NaN
2  c    92    6.0  NaN    NaN    NaN
3  d   341    NaN  NaN    NaN    NaN
4  e     1    NaN  6.0  132.0    NaN
5  f     1    6.0  NaN    NaN  492.0

If you're situated on your current df as the input, you could try this:

def process_starting_dataframe(starting_dataframe: pd.DataFrame) -> pd.DataFrame:
    """
    Create a new dataframe from original input with two columns "id" and "data
    """
    data_dict = starting_df.T.to_dict()
    data_records = [{"id": i['id'], **get_event_counts(i['data'])} for i in data_dict.values()]
    
    return pd.DataFrame(data_records)

CodePudding user response：

A much more efficient method is to construct dicts from your data. Do you observe how the alternate values in the split string are keys and values?

Then apply pd.Series and fillna(0) to get the dataframe with all required columns for the data.

Then you can concat.

Code:

df_data = df['data'].apply(
    lambda x:dict(zip(x.split(';')[1::2], x.split(';')[2::2]))).apply(pd.Series).fillna(0)
df_data.columns = df_data.columns.map('data_{}'.format)
df = pd.concat([df.drop('data',axis=1), df_data], axis=1)

output:

  id data_0 data_1 data_2 data_4 data_3
0  a   4208    790      0      0      0
1  b    768     47      0      0      0
2  c     92      6      0      0      0
3  d    341      0      0      0      0
4  e      1      0      6    132      0
5  f      1      6      0      0    492

If you need sorted columns you can just do:

df = df[sorted(df.columns)]

CodePudding user response：

Code

pairs = df['data'].str.extractall(r'(?<!^)(\d );(\d )')
pairs = pairs.droplevel(1).pivot(columns=0, values=1).fillna(0)

df[['id']].join(pairs.add_prefix('data_'))

Explained

Extract all pairs using a regex pattern

         0     1
  match         
0 0      0  4208
  1      1   790
1 0      0   768
  1      1    47
2 0      0    92
  1      1     6
3 0      0   341
4 0      0     1
  1      2     6
  2      4   132
5 0      0     1
  1      1     6
  2      3   492

Pivot the pairs to reshape into desired format

0     0    1  2    3    4
0  4208  790  0    0    0
1   768   47  0    0    0
2    92    6  0    0    0
3   341    0  0    0    0
4     1    0  6    0  132
5     1    6  0  492    0

Join the reshaped pairs dataframe back with id column

  id data_0 data_1 data_2 data_3 data_4
0  a   4208    790      0      0      0
1  b    768     47      0      0      0
2  c     92      6      0      0      0
3  d    341      0      0      0      0
4  e      1      0      6      0    132
5  f      1      6      0    492      0