I have a list that is built like this:
mylist = ['2003 00045', 'John', 'Closed', '4/10/21', '19675-B', '2001 00065',
'Kate', 'Approved', '2005 00054', 'True']
I am trying to build a dataframe where the first column will contain all of the identifiers in the list (e.g., '2003 00045', '2001 00065', '2005 00054'), and the second row will contain all of the list elements that come after it. The example data frame would like this:
col1 col2
2003 00045 John, Closed, 4/10/21, 19675-B
2001 00065 Kate, Approved
2005 00054 True
I've been able to define the regex pattern to pull out the identifier from the list, but haven't been able to figure out how to pull out all the elements following the identifier. The list is unstructured so there is no set amount of elements that come after an identifier. I experimented with using a dict
but the identifiers are not unique so if I treated them as keys, the code would overwrite values as it goes through the list.
CodePudding user response:
Try:
import re
r = re.compile(r"\d{4} \d{5}")
data, id_ = {}, None
for v in mylist:
if (m := r.match(v)):
id_ = m.group(0)
else:
data.setdefault(id_, []).append(v)
df = pd.DataFrame(
[{"col1": k, "col2": ", ".join(map(str, v))} for k, v in data.items()]
)
print(df.to_markdown())
Prints:
col1 | col2 | |
---|---|---|
0 | 2003 00045 | John, Closed, 4/10/21, 19675-B |
1 | 2001 00065 | Kate, Approved |
2 | 2005 00054 | True |
Or:
x = pd.DataFrame({"col1": mylist})
x = x.groupby(x["col1"].str.match(r"\d{4} \d{5}").cumsum()).agg(list)
df = pd.DataFrame(
{"col1": x["col1"].str[0].values, "col2": x["col1"].str[1:].values}
)
print(df)
Prints:
col1 col2
0 2003 00045 [John, Closed, 4/10/21, 19675-B]
1 2001 00065 [Kate, Approved]
2 2005 00054 [True]