I have a dataframe with column df['EVENT_DTL']
that looks like this;
1. 변사자 정보 : Kim_******-1****** 2. 발견일시 : 2013년05월18일 13:00 3. 발견장소 : 1) 수사기록 상 주소 주민등록상 주소 : 실거주지 주소 : 시도(발견)장소 주소 : 2) 실제 조사원이 입력한 주소주민등록상 주소 : 실거주지 주소 : 시도(발견)장소 주소 : 조치(사유 포함) : 4. 발견장소 코딩사유 : 자택 / 5. 방법/수단 : 목매달기6. 발견경위 : 2013.5.18 13:00경 New York, in his apartment 7. 주원인 코딩사유 : Family reason 8. 기본배경정보 : 원단도매업 / 자녀 및 손주 과 거주 결혼상태_별거 9. 사회경제적상태 : Strong depression 10. 성격 : 알수없음 11. 대인관계 : 대인관계문제_모름,친구 관련 12. 정서상태 : 우울한 기분 관찰됨 13. 경찰 최종자살판단유무 및 내용 : 자살_가족관계문제_ 목매달기 14. 코로나와의 관련성 : 없음_2020년 이전 사망 15. 코로나의 자살영향 및 주요인 : 없음_2020년 이전 사망
NOTE: The above is one line, not a separate line. I'm just displaying it for your convenience.
I want to spilt 1. 2. 3. … 15.
and append "\n" before the numbers.
Desired output looks like this:
\n1. 변사자 정보 : Kim_******-1******
\n2. 발견일시 : 2013년05월18일 13:00
\n3. 발견장소 :
\n1) 수사기록 상 주소
주민등록상 주소 :
실거주지 주소 :
시도(발견)장소 주소 :
\n2) 실제 조사원이 입력한 주소
주민등록상 주소 :
실거주지 주소 :
시도(발견)장소 주소 :
조치(사유 포함) :
\n4. 발견장소 코딩사유 : 자택 /
\n5. 방법/수단 : 목매달기
\n6. 발견경위 : 2013.5.18 13:00경 New York, in his apartment
\n7. 주원인 코딩사유 : Family reason
\n8. 기본배경정보 : 원단도매업 / 자녀 및 손주 과 거주 결혼상태_별거
\n9. 사회경제적상태 : Strong depression
\n10. 성격 : 알수없음
\n11. 대인관계 : 대인관계문제_모름,친구 관련
\n12. 정서상태 : 우울한 기분 관찰됨
\n13. 경찰 최종자살판단유무 및 내용 : 자살_가족관계문제_ 목매달기
\n14. 코로나와의 관련성 : 없음_2020년 이전 사망
\n15. 코로나의 자살영향 및 주요인 : 없음_2020년 이전 사망
I tried this (note: there are some rows that are already starts with \n
):
import re
df3 = df.loc[~df.EVENT_DTL.str.contains('\n',na=False),'EVENT_DTL']
re.split('(?<=1.|(?<=2.||(?<=3.|(?<=1\)|(?<=2)|(?<=4.|(?<=5.|(?<=6.|(?<=7.|(?<=8.|(?<=9.|(?<=10.|(?<=11.|(?<=12.|(?<=13.|(?<=14.|(?<=15.',df3)
but it cause the error such as (sorry for the long code):
error Traceback (most recent call last)
<ipython-input-20-3b8b06001e11> in <module>
2
3 df3 = df.loc[~df.EVENT_DTL.str.contains('\n',na=False),'EVENT_DTL']
----> 4 re.split('(?<=1.|(?<=2.||(?<=3.|(?<=1\)|(?<=2)|(?<=4.|(?<=5.|(?<=6.|(?<=7.|(?<=8.|(?<=9.|(?<=10.|(?<=11.|(?<=12.|(?<=13.|(?<=14.|(?<=15.',df3)
35 frames
/usr/lib/python3.7/re.py in split(pattern, string, maxsplit, flags)
213 and the remainder of the string is returned as the final element
214 of the list."""
--> 215 return _compile(pattern, flags).split(string, maxsplit)
216
217 def findall(pattern, string, flags=0):
/usr/lib/python3.7/re.py in _compile(pattern, flags)
286 if not sre_compile.isstring(pattern):
287 raise TypeError("first argument must be string or compiled pattern")
--> 288 p = sre_compile.compile(pattern, flags)
289 if not (flags & DEBUG):
290 if len(_cache) >= _MAXCACHE:
/usr/lib/python3.7/sre_compile.py in compile(p, flags)
762 if isstring(p):
763 pattern = p
--> 764 p = sre_parse.parse(p, flags)
765 else:
766 pattern = None
/usr/lib/python3.7/sre_parse.py in parse(str, flags, pattern)
922
923 try:
--> 924 p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
925 except Verbose:
926 # the VERBOSE flag was switched on inside the pattern. to be
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
728 if lookbehindgroups is None:
729 state.lookbehindgroups = state.groups
--> 730 p = _parse_sub(source, state, verbose, nested 1)
731 if dir < 0:
732 if lookbehindgroups is None:
/usr/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
418 while True:
419 itemsappend(_parse(source, state, verbose, nested 1,
--> 420 not nested and not items))
421 if not sourcematch("|"):
422 break
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
734 if not sourcematch(")"):
735 raise source.error("missing ), unterminated subpattern",
--> 736 source.tell() - start)
737 if char == "=":
738 subpatternappend((ASSERT, (dir, p)))
error: missing ), unterminated subpattern at position 119
CodePudding user response:
df['EVENT_DTL'] = "\n" df['EVENT_DTL'].astype(str)
CodePudding user response:
I think you are looking for any place where a line starts with a digit, and you want to add a special string there, before the digit.
It isn't clear to me if you want to add a newline, or want to add a slash followed by an n to it there.
This will add a newline.
result = re.sub(r"^(\d)", r"\n\1", df3, flags=re.MULTILINE)) print(result)
This will add a "\n" as a two-character string. result = re.sub(r"^(\d)", r"\n\1", df3, flags=re.MULTILINE)) print(result)
This works by searching for a newline (indicated by ^
) followed by any digit (\d
), and then substituting it with "\n" followed by the originally matched digit (\1
- the first matched "group")
CodePudding user response:
You can use replace
in pandas with setting regex=True
:
df['EVENT_DTL'].replace(r"(\d [\.|\)] )", r"\n\1", regex=True)
The regex will match any subsequences starting with a number (\d
) with either a .
or )
afterwards ([\.|\)]
) and then a space. It will replace this subsequence with "\n" added to the subsequence itself (see capture groups).
A more detailed explanation for the regex can be found here: https://regex101.com/r/2peTg4/1
Result of applying the regex and splitting on "\n", i.e.:
df['EVENT_DTL'].replace(r"( \d [\.|\)] )", r"\n\1", regex=True).str.split("\n").explode()
1 1. 변사자 정보 : Kim_******-1******
2 2. 발견일시 : 2013년05월18일 13:00
3 3. 발견장소 :
4 1) 수사기록 상 주소 주민등록상 주소 : 실거주지 주소 : 시도(발견)장소 주소 :
5 2) 실제 조사원이 입력한 주소주민등록상 주소 : 실거주지 주소 : 시도(발견)장소...
6 4. 발견장소 코딩사유 : 자택 /
7 5. 방법/수단 : 목매달기
8 6. 발견경위 : 2013.5.18 13:00경 New York, in his ap...
9 7. 주원인 코딩사유 : Family reason
10 8. 기본배경정보 : 원단도매업 / 자녀 및 손주 과 거주 결혼상태_별거
11 9. 사회경제적상태 : Strong depression
12 10. 성격 : 알수없음
13 11. 대인관계 : 대인관계문제_모름,친구 관련
14 12. 정서상태 : 우울한 기분 관찰됨
15 13. 경찰 최종자살판단유무 및 내용 : 자살_가족관계문제_ 목매달기
16 14. 코로나와의 관련성 : 없음_2020년 이전 사망
17 15. 코로나의 자살영향 및 주요인 : 없음_2020년 이전 사망
Name: EVENT_DTL, dtype: object