Home > Mobile >  Splitting list array by delimeters
Splitting list array by delimeters

Time:03-12

I need to loop through an array like the one below in Python and output a dataframe. I need to extract the text between delimiters under column headers. For a customer with 4 instances say the columns for the array below for instance would be

[""13900979_UKI_UK-12/01/2022-03/09/2021"", ""14703087_UKI_UK-12/01/2022-10/01/2022"", ""14929368_UKI_UK-27/01/2022-25/01/2022"", ""14991771_UKI_UK-01/02/2022-30/01/2022""]

  Ref      Market   MarketCode  ReceivedDate  SentDate  
  13900979  UKI      UK           12/01/2022    03/09/2021
  14703087  UKI      UK           12/01/2022    10/01/2022        
  14929368  UKI      UK           27/01/2022    25/01/2022
  14991771  UKI      UK           01/02/2022    30/01/2022

How do I do this dynamically to cater to a varying length array? The example above is for a customer who has had 4 instances of the item. Others have 1, 2.....etc, instances

CodePudding user response:

You can read the strings as Series and split:

lst = ["13900979_UKI_UK-12/01/2022-03/09/2021", 
       "14703087_UKI_UK-12/01/2022-10/01/2022", 
       "14929368_UKI_UK-27/01/2022-25/01/2022", 
       "14991771_UKI_UK-01/02/2022-30/01/2022"]
cols = ['Ref', 'Market', 'MarketCode', 'ReceivedDate', 'SentDate']

df = pd.Series(lst).str.split('[_-]', expand=True)
df.columns = cols

output:

        Ref Market MarketCode ReceivedDate    SentDate
0  13900979    UKI         UK   12/01/2022  03/09/2021
1  14703087    UKI         UK   12/01/2022  10/01/2022
2  14929368    UKI         UK   27/01/2022  25/01/2022
3  14991771    UKI         UK   01/02/2022  30/01/2022

CodePudding user response:

Try with list comprehension:

lst = ["13900979_UKI_UK-12/01/2022-03/09/2021", 
       "14703087_UKI_UK-12/01/2022-10/01/2022", 
       "14929368_UKI_UK-27/01/2022-25/01/2022", 
       "14991771_UKI_UK-01/02/2022-30/01/2022"]

df = pd.DataFrame(data=[l.replace("-","_").split("_") for l in lst], 
                  columns=["Ref", "Market", "MarketCode", "ReceivedDate", "SentDate"])

>>> df
        Ref Market MarketCode ReceivedDate    SentDate
0  13900979    UKI         UK   12/01/2022  03/09/2021
1  14703087    UKI         UK   12/01/2022  10/01/2022
2  14929368    UKI         UK   27/01/2022  25/01/2022
3  14991771    UKI         UK   01/02/2022  30/01/2022

CodePudding user response:

Use StringIO with read_csv
This leverages read_csv's date parsing as well.

import pandas as pd
from io import StringIO

data = [
    "13900979_UKI_UK-12/01/2022-03/09/2021",
    "14703087_UKI_UK-12/01/2022-10/01/2022",
    "14929368_UKI_UK-27/01/2022-25/01/2022",
    "14991771_UKI_UK-01/02/2022-30/01/2022"
]

df = pd.read_csv(
    StringIO('\n'.join(data)),
    delimiter='[-_]',  # Use `engine='python'` because
    engine='python',   # our delimiter is regex
    names=['Ref', 'Market', 'MarketCode', 'ReceivedDate', 'SentDate'],
    parse_dates=[3, 4]  # These are the column positions of date columns
)

df

        Ref Market MarketCode ReceivedDate   SentDate
0  13900979    UKI         UK   2022-12-01 2021-03-09
1  14703087    UKI         UK   2022-12-01 2022-10-01
2  14929368    UKI         UK   2022-01-27 2022-01-25
3  14991771    UKI         UK   2022-01-02 2022-01-30

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Ref           4 non-null      int64         
 1   Market        4 non-null      object        
 2   MarketCode    4 non-null      object        
 3   ReceivedDate  4 non-null      datetime64[ns]
 4   SentDate      4 non-null      datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 288.0  bytes

CodePudding user response:

Just for kicks, here's an alternate solution using regex -- assuming the pattern of your data is consistent:

import re

data = ["13900979_UKI_UK-12/01/2022-03/09/2021", 
        "14703087_UKI_UK-12/01/2022-10/01/2022", 
        "14929368_UKI_UK-27/01/2022-25/01/2022", 
        "14991771_UKI_UK-01/02/2022-30/01/2022"]

pattern = re.compile(r"(?P<Ref>\d{8})_(?P<Market>\w{3})_(?P<MarketCode>\w{2})-(?P<ReceivedDate>\d{2}\/\d{2}\/\d{4})-(?P<SentDate>\d{2}\/\d{2}\/\d{4})")

pd.DataFrame([pattern.match(line).groupdict() for line in data])

Output:

        Ref Market MarketCode ReceivedDate    SentDate
0  13900979    UKI         UK   12/01/2022  03/09/2021
1  14703087    UKI         UK   12/01/2022  10/01/2022
2  14929368    UKI         UK   27/01/2022  25/01/2022
3  14991771    UKI         UK   01/02/2022  30/01/2022
  • Related