I have a column in a dataset that has string and digits, (Column 2), I need to extract digits with 10 or more. as (Column 3) / output. any idea how to do this?
Column1 | Column2 |
---|---|
A | ghjy 123456677777 rttt 123.987 rtdggd |
ABC | 90999888877 asrteg 12.98 tggff 12300004 |
B | thdhdjdj 123 jsjsjjsjl tehshshs 126666555533333 |
DLT | 1.2897 thhhsskkkk 456633388899000022 |
XYZ | tteerr 12.34 |
Expected output: |Column3| |-------| |123456677777| |90999888877| |126666555533333| |456633388899000000| | |
I tried a few codes, regex, lambda function, apply, map, but is taking the entire column as one string. didnt want to split it because real dataset has so many words and digits on it.
CodePudding user response:
You could try:
df['Column3'] = df['Column2'].str.extract(r'(\d{10,})')
print(df)
Column1 Column2 Column3
0 A ghjy 123456677777 rttt 123.987 rtdggd 123456677777
1 ABC 90999888877 asrteg 12.98 tggff 12300004 90999888877
2 B thdhdjdj 123 jsjsjjsjl tehshshs 126666555533333 126666555533333
3 DLT 1.2897 thhhsskkkk 456633388899000022 456633388899000022
4 XYZ tteerr 12.34 NaN
To allow for multiple matches per string, you could do:
df['Column3'] = df['Column2'].str.findall(r'(\d{10,})').apply(', '.join)
CodePudding user response:
Maybe this works:
- Take the value of the Column 2
- Split the values
- for loop the values
- Check if the value is numeric and if the length is equal or greater than 10
- Get the value if the previous validation is true
- Set the value to the Column 3