Extract elements from a list into a dataframe based on a regex string-CodePudding

I have a list that is built like this:

mylist = ['2003 00045', 'John', 'Closed', '4/10/21', '19675-B', '2001 00065',
'Kate', 'Approved', '2005 00054', 'True']

I am trying to build a dataframe where the first column will contain all of the identifiers in the list (e.g., '2003 00045', '2001 00065', '2005 00054'), and the second row will contain all of the list elements that come after it. The example data frame would like this:

col1        col2
2003 00045  John, Closed, 4/10/21, 19675-B
2001 00065  Kate, Approved
2005 00054  True

I've been able to define the regex pattern to pull out the identifier from the list, but haven't been able to figure out how to pull out all the elements following the identifier. The list is unstructured so there is no set amount of elements that come after an identifier. I experimented with using a dict but the identifiers are not unique so if I treated them as keys, the code would overwrite values as it goes through the list.

CodePudding user response：

Try:

import re

r = re.compile(r"\d{4} \d{5}")

data, id_ = {}, None
for v in mylist:
    if (m := r.match(v)):
        id_ = m.group(0)
    else:
        data.setdefault(id_, []).append(v)

df = pd.DataFrame(
    [{"col1": k, "col2": ", ".join(map(str, v))} for k, v in data.items()]
)
print(df.to_markdown())

Prints:

	col1	col2
0	2003 00045	John, Closed, 4/10/21, 19675-B
1	2001 00065	Kate, Approved
2	2005 00054	True

Or:

x = pd.DataFrame({"col1": mylist})
x = x.groupby(x["col1"].str.match(r"\d{4} \d{5}").cumsum()).agg(list)
df = pd.DataFrame(
    {"col1": x["col1"].str[0].values, "col2": x["col1"].str[1:].values}
)
print(df)

Prints:

         col1                              col2
0  2003 00045  [John, Closed, 4/10/21, 19675-B]
1  2001 00065                  [Kate, Approved]
2  2005 00054                            [True]