How to split numbers in string type column?-CodePudding

I have a dataframe with column df['EVENT_DTL'] that looks like this;

1. 변사자 정보 : Kim_******-1****** 2. 발견일시 : 2013년05월18일 13:00 3. 발견장소 : 1) 수사기록 상 주소 주민등록상 주소 : 실거주지 주소 : 시도(발견)장소 주소 :  2) 실제 조사원이 입력한 주소주민등록상 주소 : 실거주지 주소 : 시도(발견)장소 주소 : 조치(사유 포함) : 4. 발견장소 코딩사유 : 자택 / 5. 방법/수단 : 목매달기6. 발견경위 : 2013.5.18 13:00경 New York, in his apartment 7. 주원인 코딩사유 : Family reason 8. 기본배경정보 : 원단도매업 /     자녀 및 손주    과 거주 결혼상태_별거 9. 사회경제적상태 : Strong depression 10. 성격 : 알수없음 11. 대인관계 : 대인관계문제_모름,친구 관련 12. 정서상태 : 우울한 기분 관찰됨  13. 경찰 최종자살판단유무 및 내용 : 자살_가족관계문제_      목매달기     14. 코로나와의 관련성 : 없음_2020년 이전 사망 15. 코로나의 자살영향 및 주요인 : 없음_2020년 이전 사망

NOTE: The above is one line, not a separate line. I'm just displaying it for your convenience.

I want to spilt 1. 2. 3. … 15. and append "\n" before the numbers.

Desired output looks like this:

\n1. 변사자 정보 : Kim_******-1******
\n2. 발견일시 : 2013년05월18일 13:00
\n3. 발견장소 : 
\n1) 수사기록 상 주소 
주민등록상 주소 : 
실거주지 주소 : 
시도(발견)장소 주소 :  
\n2) 실제 조사원이 입력한 주소
주민등록상 주소 : 
실거주지 주소 : 
시도(발견)장소 주소 : 
조치(사유 포함) :
\n4. 발견장소 코딩사유 : 자택 / 
\n5. 방법/수단 : 목매달기
\n6. 발견경위 : 2013.5.18 13:00경 New York, in his apartment
\n7. 주원인 코딩사유 : Family reason
\n8. 기본배경정보 : 원단도매업 /     자녀 및 손주    과 거주 결혼상태_별거
\n9. 사회경제적상태 : Strong depression
\n10. 성격 : 알수없음
\n11. 대인관계 : 대인관계문제_모름,친구 관련
\n12. 정서상태 : 우울한 기분 관찰됨   
\n13. 경찰 최종자살판단유무 및 내용 : 자살_가족관계문제_      목매달기     
\n14. 코로나와의 관련성 : 없음_2020년 이전 사망
\n15. 코로나의 자살영향 및 주요인 : 없음_2020년 이전 사망

I tried this (note: there are some rows that are already starts with \n):

import re

df3 = df.loc[~df.EVENT_DTL.str.contains('\n',na=False),'EVENT_DTL']
re.split('(?<=1.|(?<=2.||(?<=3.|(?<=1\)|(?<=2)|(?<=4.|(?<=5.|(?<=6.|(?<=7.|(?<=8.|(?<=9.|(?<=10.|(?<=11.|(?<=12.|(?<=13.|(?<=14.|(?<=15.',df3)

but it cause the error such as (sorry for the long code):

error                                     Traceback (most recent call last)
<ipython-input-20-3b8b06001e11> in <module>
      2 
      3 df3 = df.loc[~df.EVENT_DTL.str.contains('\n',na=False),'EVENT_DTL']
----> 4 re.split('(?<=1.|(?<=2.||(?<=3.|(?<=1\)|(?<=2)|(?<=4.|(?<=5.|(?<=6.|(?<=7.|(?<=8.|(?<=9.|(?<=10.|(?<=11.|(?<=12.|(?<=13.|(?<=14.|(?<=15.',df3)

35 frames
/usr/lib/python3.7/re.py in split(pattern, string, maxsplit, flags)
    213     and the remainder of the string is returned as the final element
    214     of the list."""
--> 215     return _compile(pattern, flags).split(string, maxsplit)
    216 
    217 def findall(pattern, string, flags=0):

/usr/lib/python3.7/re.py in _compile(pattern, flags)
    286     if not sre_compile.isstring(pattern):
    287         raise TypeError("first argument must be string or compiled pattern")
--> 288     p = sre_compile.compile(pattern, flags)
    289     if not (flags & DEBUG):
    290         if len(_cache) >= _MAXCACHE:

/usr/lib/python3.7/sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--> 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

/usr/lib/python3.7/sre_parse.py in parse(str, flags, pattern)
    922 
    923     try:
--> 924         p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    925     except Verbose:
    926         # the VERBOSE flag was switched on inside the pattern.  to be

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    728                         if lookbehindgroups is None:
    729                             state.lookbehindgroups = state.groups
--> 730                     p = _parse_sub(source, state, verbose, nested   1)
    731                     if dir < 0:
    732                         if lookbehindgroups is None:

/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    418     while True:
    419         itemsappend(_parse(source, state, verbose, nested   1,
--> 420                            not nested and not items))
    421         if not sourcematch("|"):
    422             break

/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    734                     if not sourcematch(")"):
    735                         raise source.error("missing ), unterminated subpattern",
--> 736                                            source.tell() - start)
    737                     if char == "=":
    738                         subpatternappend((ASSERT, (dir, p)))

error: missing ), unterminated subpattern at position 119

CodePudding user response：

df['EVENT_DTL'] = "\n" df['EVENT_DTL'].astype(str)

CodePudding user response：

I think you are looking for any place where a line starts with a digit, and you want to add a special string there, before the digit.

It isn't clear to me if you want to add a newline, or want to add a slash followed by an n to it there.

This will add a newline.

result = re.sub(r"^(\d)", r"\n\1", df3, flags=re.MULTILINE)) print(result)

This will add a "\n" as a two-character string. result = re.sub(r"^(\d)", r"\n\1", df3, flags=re.MULTILINE)) print(result)

This works by searching for a newline (indicated by ^) followed by any digit (\d), and then substituting it with "\n" followed by the originally matched digit (\1 - the first matched "group")

CodePudding user response：

You can use replace in pandas with setting regex=True:

df['EVENT_DTL'].replace(r"(\d [\.|\)] )", r"\n\1", regex=True)

The regex will match any subsequences starting with a number (\d ) with either a . or ) afterwards ([\.|\)]) and then a space. It will replace this subsequence with "\n" added to the subsequence itself (see capture groups).

A more detailed explanation for the regex can be found here: https://regex101.com/r/2peTg4/1

Result of applying the regex and splitting on "\n", i.e.:

df['EVENT_DTL'].replace(r"( \d [\.|\)] )", r"\n\1", regex=True).str.split("\n").explode()

1                       1. 변사자 정보 : Kim_******-1****** 
2                          2. 발견일시 : 2013년05월18일 13:00 
3                                            3. 발견장소 : 
4     1) 수사기록 상 주소 주민등록상 주소 : 실거주지 주소 : 시도(발견)장소 주소 :  
5     2) 실제 조사원이 입력한 주소주민등록상 주소 : 실거주지 주소 : 시도(발견)장소...
6                                  4. 발견장소 코딩사유 : 자택 / 
7                                       5. 방법/수단 : 목매달기
8     6. 발견경위 : 2013.5.18 13:00경 New York, in his ap...
9                          7. 주원인 코딩사유 : Family reason 
10     8. 기본배경정보 : 원단도매업 /     자녀 및 손주    과 거주 결혼상태_별거 
11                      9. 사회경제적상태 : Strong depression 
12                                       10. 성격 : 알수없음 
13                          11. 대인관계 : 대인관계문제_모름,친구 관련 
14                              12. 정서상태 : 우울한 기분 관찰됨  
15     13. 경찰 최종자살판단유무 및 내용 : 자살_가족관계문제_      목매달기     
16                      14. 코로나와의 관련성 : 없음_2020년 이전 사망 
17                 15. 코로나의 자살영향 및 주요인 : 없음_2020년 이전 사망
Name: EVENT_DTL, dtype: object