I need to loop through an array like the one below in Python and output a dataframe. I need to extract the text between delimiters under column headers. For a customer with 4 instances say the columns for the array below for instance would be
[""13900979_UKI_UK-12/01/2022-03/09/2021"", ""14703087_UKI_UK-12/01/2022-10/01/2022"", ""14929368_UKI_UK-27/01/2022-25/01/2022"", ""14991771_UKI_UK-01/02/2022-30/01/2022""]
Ref Market MarketCode ReceivedDate SentDate
13900979 UKI UK 12/01/2022 03/09/2021
14703087 UKI UK 12/01/2022 10/01/2022
14929368 UKI UK 27/01/2022 25/01/2022
14991771 UKI UK 01/02/2022 30/01/2022
How do I do this dynamically to cater to a varying length array? The example above is for a customer who has had 4 instances of the item. Others have 1, 2.....etc, instances
CodePudding user response:
You can read the strings as Series and split:
lst = ["13900979_UKI_UK-12/01/2022-03/09/2021",
"14703087_UKI_UK-12/01/2022-10/01/2022",
"14929368_UKI_UK-27/01/2022-25/01/2022",
"14991771_UKI_UK-01/02/2022-30/01/2022"]
cols = ['Ref', 'Market', 'MarketCode', 'ReceivedDate', 'SentDate']
df = pd.Series(lst).str.split('[_-]', expand=True)
df.columns = cols
output:
Ref Market MarketCode ReceivedDate SentDate
0 13900979 UKI UK 12/01/2022 03/09/2021
1 14703087 UKI UK 12/01/2022 10/01/2022
2 14929368 UKI UK 27/01/2022 25/01/2022
3 14991771 UKI UK 01/02/2022 30/01/2022
CodePudding user response:
Try with list comprehension:
lst = ["13900979_UKI_UK-12/01/2022-03/09/2021",
"14703087_UKI_UK-12/01/2022-10/01/2022",
"14929368_UKI_UK-27/01/2022-25/01/2022",
"14991771_UKI_UK-01/02/2022-30/01/2022"]
df = pd.DataFrame(data=[l.replace("-","_").split("_") for l in lst],
columns=["Ref", "Market", "MarketCode", "ReceivedDate", "SentDate"])
>>> df
Ref Market MarketCode ReceivedDate SentDate
0 13900979 UKI UK 12/01/2022 03/09/2021
1 14703087 UKI UK 12/01/2022 10/01/2022
2 14929368 UKI UK 27/01/2022 25/01/2022
3 14991771 UKI UK 01/02/2022 30/01/2022
CodePudding user response:
Use StringIO
with read_csv
This leverages read_csv
's date parsing as well.
import pandas as pd
from io import StringIO
data = [
"13900979_UKI_UK-12/01/2022-03/09/2021",
"14703087_UKI_UK-12/01/2022-10/01/2022",
"14929368_UKI_UK-27/01/2022-25/01/2022",
"14991771_UKI_UK-01/02/2022-30/01/2022"
]
df = pd.read_csv(
StringIO('\n'.join(data)),
delimiter='[-_]', # Use `engine='python'` because
engine='python', # our delimiter is regex
names=['Ref', 'Market', 'MarketCode', 'ReceivedDate', 'SentDate'],
parse_dates=[3, 4] # These are the column positions of date columns
)
df
Ref Market MarketCode ReceivedDate SentDate
0 13900979 UKI UK 2022-12-01 2021-03-09
1 14703087 UKI UK 2022-12-01 2022-10-01
2 14929368 UKI UK 2022-01-27 2022-01-25
3 14991771 UKI UK 2022-01-02 2022-01-30
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ref 4 non-null int64
1 Market 4 non-null object
2 MarketCode 4 non-null object
3 ReceivedDate 4 non-null datetime64[ns]
4 SentDate 4 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 288.0 bytes
CodePudding user response:
Just for kicks, here's an alternate solution using regex -- assuming the pattern of your data is consistent:
import re
data = ["13900979_UKI_UK-12/01/2022-03/09/2021",
"14703087_UKI_UK-12/01/2022-10/01/2022",
"14929368_UKI_UK-27/01/2022-25/01/2022",
"14991771_UKI_UK-01/02/2022-30/01/2022"]
pattern = re.compile(r"(?P<Ref>\d{8})_(?P<Market>\w{3})_(?P<MarketCode>\w{2})-(?P<ReceivedDate>\d{2}\/\d{2}\/\d{4})-(?P<SentDate>\d{2}\/\d{2}\/\d{4})")
pd.DataFrame([pattern.match(line).groupdict() for line in data])
Output:
Ref Market MarketCode ReceivedDate SentDate
0 13900979 UKI UK 12/01/2022 03/09/2021
1 14703087 UKI UK 12/01/2022 10/01/2022
2 14929368 UKI UK 27/01/2022 25/01/2022
3 14991771 UKI UK 01/02/2022 30/01/2022