Home > Software engineering >  split text data into row data records using python pandas
split text data into row data records using python pandas

Time:12-16

As I am new to python as I am trying to split the text data and convert into as excel columns and row records. suppose I have 100 records as I need to split into as 1-7 is one column,8-8 is second column,9-10 is third column and 11-18 is fourth column, 5th column is 19-24,6th column is 25-124,7th column is 125-1000. The below example records are in text.txt. I want to convert into excel file based on the above mentioned characters. can anyone help me would be appreciated.

example text format :

9999999M0210012021454   Copyright 2021 National Council for Prescription Drug Programs, All Rights Reserved                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
00301ABS LLC SO CAL AND IMW             P O BOX 742382                                                                                                LOS ANGELES                   CA9007423822083953954     6232823834          820184434      KATHY GIANNAKOPOULOS          MGR, 3RD PARTY [email protected]               MICHAEL MOLLSEN               DIRECTOR, MANAGED CARE        [email protected]                     JESSICA WILTS                 SR MGR, MANAGED CARE          [email protected]                      MARC ALLGOOD                  PHARMACY SYSTEMS DIRECTOR     [email protected]                       JUDEE OLIMPO                  MANAGER, 3RD PARTY AUDIT & [email protected]                       0003640503199600000000                                                                                                                                                                                           
00801CVS PHARMACY INC                   1 CVS DRIVE                                            BOX 1075                                               WOONSOCKET                    RI02895    4017651500     4017707108                         SUSAN COLBERT                 DIRECTOR, PAYER RELATIONS     [email protected]                                                                                                                                   ANTHONY GRATTO                MANAGER, PAYER RELATIONS      [email protected]                                                                                                                                                                                                                                                0000340101200100000000                                                                                                                                                                                           
01101THE BARTELL DRUG COMPANY           4025 DELRIDGE WAY SW STE 400                                                                                  SEATTLE                       WA9810612737179755937     7179758659          910138195      JENNIFER ZOREK                DIRECTOR                      [email protected]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        0002571218202000000000                                                                                                                                                                                           

The above records are the example to split into the data as row.

Example Output format :

                   0   1  2  3           4   5   6
      Headers  1. 9999 m  01 10012021
      Rows     2. ------below is the records-------------
               3. ---------------------------------------

CodePudding user response:

Use the .str accessor

column_splits = {'first': [0, 7], 'second': [7, 10]}

for column, limits in column_splits.items():
    start, end = limits
    df[column] = df['your_column'].str[start: end]

CodePudding user response:

You can combine itertools.tee and zip_longest

Function to split:

from itertools import tee, zip_longest

def split_by_index(s):
  indices = [0,7,10,14,20]
  start, end = tee(indices)
  next(end)
  return " ".join([s[i:j] for i,j in zip_longest(start, end)])

You data:

import pandas as pd

df = pd.DataFrame()
df["sentence"] = ["animals120 redlivinginjungle",
                  "animals140 redlivinginjungle",
                  "animals160 redlivinginjungle"]


    sentence
0   animals120 redlivinginjungle
1   animals140 redlivinginjungle
2   animals160 redlivinginjungle

Then apply function to create new dataframe:

new_df = df["sentence"].apply(split_by_index).str.split(expand=True)

Output

print(new_df)

    0       1   2   3       4
0   animals 120 red living  injungle
1   animals 140 red living  injungle
2   animals 160 red living  injungle
  • Related