I am actually trying to find all the URLs in string. There are many similar types of questions here in this website but nothing matches exactly what I want. My text may contain URLs as well as other text with dots. An example is shown below:
This is a text. It may contain links such as https://stackoverflow.com/, www.test.com but it may also contain other things such as exampleimage.this.png or picture.man.jpeg which is not a URL. On the other hand it many contain URLs without protocol such as example.com
So in the text above, I would only like to get the urls, so mainly -
- https://stackoverflow.com/
- www.test.com
- example.com
But I should not get exampleimage.this.png or picture.man.jpeg as a url.
I have tried
- "(?:(?:https?|ftp)://)?[\w/-?=%.] .[\w/-&?=%.] " which gives me all the urls except example.com.
And I have also tried
- "(ftp://|www.|https?://){1}[a-zA-Z0-9u00a1-\uffff0-]{2,}.[a-zA-Z0-9u00a1-\uffff0-]{2,}(\S*)"
- "(?:(?:https?|ftp)://|\b(?:[a-z\d] .))(?:(?:[^\s()<>] |((?:[^\s()<>] |(?:([^\s()<>] )))?)) (?:((?:[^\s()<>] |(?:(?:[^\s()<>] )))?)|[^\s`!()[]{};:'"".,<>?«»“”‘’])"
which gives me all the urls with the exampleimage.this.png and picture.man.jpeg which is not what I want.
Could anybody please help me out? Anything other than Regex would also be fine. I am using C# with Regex for this.
CodePudding user response:
You need to match up to a valid TLD:
var pattern = @"(?i)\b(?:(?:http|ftp)s?://|www\.)?[\w/?=%.-]*?\.(?:A(?:A(?:A|RP)|B(?:ARTH|B(?:OTT|VIE)?|C|LE|OGADO|UDHABI)|C(?:ADEMY|C(?:ENTURE|OUNTANTS?)|O|TOR)?|D(?:AC|S|ULT)?|E(?:G|RO|TNA)?|F(?:L|RICA)?|G(?:AKHAN|ENCY)?|I(?:G|R(?:BUS|FORCE|TEL))?|KDN|L(?:FAROMEO|I(?:BABA|PAY)|L(?:FINANZ|STATE|Y)|S(?:ACE|TOM))?|M(?:AZON|E(?:RICAN(?:EXPRESS|FAMILY)|X)|FAM|ICA|STERDAM)?|N(?:ALYTICS|DROID|QUAN|Z)|OL?|P(?:ARTMENTS|P(?:LE)?)|Q(?:UARELLE)?|R(?:A(?:B|MCO)|CHI|MY|PA|TE?)?|S(?:DA|IA|SOCIATES)?|T(?:HLETA|TORNEY)?|U(?:CTION|DI(?:BLE|O)?|SPOST|T(?:HOR|OS?))?|VIANCA|WS?|XA?|Z(?:URE)?)|B(?:A(?:BY|IDU|N(?:A(?:MEX|NAREPUBLIC)|[DK])|R(?:C(?:ELONA|LAY(?:CARD|S))|EFOOT|GAINS)?|S(?:E|KET)BALL|UHAUS|YERN)?|B(?:[CT]|VA)?|C[GN]|D|E(?:A(?:TS|UTY)|ER|NTLEY|RLIN|ST(?:BUY)?|T)?|[FG]|H(?:ARTI)?|I(?:BLE|D|KE|NGO?|[OZ])?|J|L(?:ACK(?:FRIDAY)?|O(?:CKBUSTER|G|OMBERG)|UE)|M[SW]?|N(?:PPARIBAS)?|O(?:ATS|EHRINGER|FA|M|ND|O(?:K(?:ING)?)?|S(?:CH|T(?:IK|ON))|T|UTIQUE|X)?|R(?:ADESCO|IDGESTONE|O(?:ADWAY|KER|THER)|USSELS)?|[ST]|U(?:DAPEST|GATTI|ILD(?:ERS)?|SINESS|Y|ZZ)|[VWY]|ZH?)|C(?:A(?:B|FE|L(?:L|VINKLEIN)?|M(?:ERA|P)?|N(?:CERRESEARCH|ON)|P(?:ETOWN|ITAL(?:ONE)?)|R(?:AVAN|DS|E(?:ERS?)?|S)?|S(?:[AEH]|INO)|T(?:ERING|HOLIC)?)?|B(?:[AN]|RE|S)|[CD]|E(?:NTER|O|RN)|F[AD]?|G|H(?:A(?:N(?:E|NE)L|RITY|SE|T)|EAP|INTAI|R(?:ISTMAS|OME)|URCH)?|I(?:PRIANI|RCLE|SCO|T(?:ADEL|IC?|Y(?:EATS)?))?|K|L(?:AIMS|EANING|I(?:CK|NI(?:C|QUE))|O(?:THING|UD)|UB(?:MED)?)?|[MN]|O(?:ACH|DES|FFEE|L(?:LEG|OGN)E|M(?:CAST|M(?:BANK|UNITY)|P(?:A(?:NY|RE)|UTER)|SEC)?|N(?:DOS|S(?:TRUCTION|ULTING)|T(?:ACT|RACTORS))|O(?:KING(?:CHANNEL)?|[LP])|RSICA|U(?:NTRY|PONS?|RSES))?|PA|R(?:EDIT(?:CARD|UNION)?|ICKET|OWN|S|UISES?)?|SC|U(?:ISINELLA)?|[V-X]|Y(?:MRU|OU)?|Z)|D(?:A(?:BUR|D|NCE|T(?:[AE]|ING|SUN)|Y)|CLK|DS|E(?:AL(?:ER|S)?|GREE|L(?:IVERY|L|OITTE|TA)|MOCRAT|NT(?:AL|IST)|SI(?:GN)?|V)?|HL|I(?:AMONDS|ET|GITAL|RECT(?:ORY)?|S(?:CO(?:UNT|VER)|H)|Y)|[JKM]|NP|O(?:C(?:S|TOR)|G|MAINS|T|WNLOAD)?|RIVE|TV|U(?:BAI|NLOP|PONT|RBAN)|V(?:AG|R)|Z)|E(?:A(?:RTH|T)|CO?|D(?:EKA|U(?:CATION)?)|[EG]|M(?:AIL|ERCK)|N(?:ERGY|GINEER(?:ING)?|TERPRISES)|PSON|QUIPMENT|R(?:ICSSON|NI)?|S(?:Q|TATE)?|T(?:ISALAT)?|U(?:ROVISION|S)?|VENTS|X(?:CHANGE|P(?:ERT|OSED|RESS)|TRASPACE))|F(?:A(?:GE|I(?:L|RWINDS|TH)|MILY|NS?|RM(?:ERS)?|S(?:HION|T))|E(?:DEX|EDBACK|RR(?:ARI|ERO))|I(?:AT|D(?:ELITY|O)|LM|NA(?:L|NC(?:E|IAL))|R(?:E(?:STONE)?|MDALE)|SH(?:ING)?|T(?:NESS)?)?|[JK]|L(?:I(?:CKR|GHTS|R)|O(?:RIST|WERS)|Y)|M|O(?:O(?:D(?:NETWORK)?|TBALL)?|R(?:D|EX|SALE|UM)|UNDATION|X)?|R(?:E(?:E|SENIUS)|L|O(?:GANS|NT(?:DOO|IE)R))?|TR|U(?:JITSU|ND?|RNITURE|TBOL)|YI)|G(?:A(?:L(?:L(?:ERY|O|UP))?|MES?|P|RDEN|Y)?|B(?:IZ)?|DN?|E(?:A|NT(?:ING)?|ORGE)?|F|G(?:EE)?|H|I(?:FTS?|V(?:ES|ING))?|L(?:ASS|E|OB(?:AL|O))?|M(?:AIL|BH|[OX])?|N|O(?:DADDY|L(?:D(?:POINT)?|F)|O(?:DYEAR|G(?:LE)?)?|[PTV])|[PQ]|R(?:A(?:INGER|PHICS|TIS)|EEN|IPE|O(?:CERY|UP))?|[ST]|U(?:ARDIAN|CCI|GE|I(?:DE|TARS)|RU)?|[WY])|H(?:A(?:IR|MBURG|NGOUT|US)|BO|DFC(?:BANK)?|E(?:ALTH(?:CARE)?|L(?:P|SINKI)|R(?:E|MES))|GTV|I(?:PHOP|SAMITSU|TACHI|V)|KT?|[MN]|O(?:CKEY|L(?:DINGS|IDAY)|ME(?:DEPOT|GOODS|S(?:ENSE)?)|NDA|RSE|S(?:PITAL|T(?:ING)?)|T(?:EL(?:E)?S|MAIL)?|USE|W)|R|SBC|T|U(?:GHES)?|Y(?:ATT|UNDAI))|I(?:BM|C(?:BC|[EU])|D|E(?:EE)?|FM|KANO|L|M(?:AMAT|DB|MO(?:BILIEN)?)?|N(?:C|DUSTRIES|F(?:INITI|O)|[GK]|S(?:TITUTE|UR(?:ANC)?E)|T(?:ERNATIONAL|UIT)?|VESTMENTS)?|O|PIRANGA|Q|R(?:ISH)?|S(?:MAILI|T(?:ANBUL)?)?|T(?:AU|V)?)|J(?:A(?:GUAR|VA)|CB|E(?:EP|TZT|WELRY)?|IO|LL|MP?|NJ|O(?:B(?:S|URG)|[TY])?|P(?:MORGAN|RS)?|U(?:EGOS|NIPER))|K(?:AUFEN|DDI|E(?:RRY(?:HOTEL|LOGISTIC|PROPERTIE)S)?|FH|[GH]|I(?:[AM]|ND(?:ER|LE)|TCHEN|WI)?|[MN]|O(?:ELN|MATSU|SHER)|P(?:MG|N)?|R(?:D|ED)?|UOKGROUP|W|Y(?:OTO)?|Z)|L(?:A(?:CAIXA|M(?:BORGHINI|ER)|N(?:C(?:ASTER|IA)|D(?:ROVER)?|XESS)|SALLE|T(?:INO|ROBE)?|W(?:YER)?)?|[BC]|DS|E(?:ASE|CLERC|FRAK|G(?:AL|O)|XUS)|GBT|I(?:DL|FE(?:INSURANCE|STYLE)?|GHTING|KE|LLY|M(?:ITED|O)|N(?:COLN|DE|K)|PSY|V(?:E|ING))?|K|L[CP]|O(?:ANS?|C(?:KER|US)|FT|L|NDON|TT[EO]|VE)|PL(?:FINANCIAL)?|[RS]|T(?:DA?)?|U(?:NDBECK|X(?:E|URY))?|[VY])|M(?:A(?:CYS|DRID|I(?:F|SON)|KEUP|N(?:AGEMENT|GO)?|P|R(?:KET(?:ING|S)?|RIOTT|SHALLS)|SERATI|TTEL)?|BA|C(?:KINSEY)?|D|E(?:D(?:IA)?|ET|LBOURNE|M(?:E|ORIAL)|NU?|RCKMSD)?|[GH]|I(?:AMI|CROSOFT|L|N[IT]|T(?:SUBISHI)?)|K|L[BS]?|MA?|N|O(?:BI(?:LE)?|DA|[EIM]|N(?:ASH|EY|STER)|R(?:MON|TGAGE)|SCOW|TO(?:RCYCLES)?|V(?:IE)?)?|[P-R]|SD?|T[NR]?|U(?:S(?:EUM|IC)|TUAL)?|[V-Z])|N(?:A(?:B|GOYA|ME|TURA|VY)?|BA|C|E(?:C|T(?:BANK|FLIX|WORK)?|USTAR|WS?|X(?:T(?:DIRECT)?|US))?|FL?|GO?|HK|I(?:CO|K(?:E|ON)|NJA|SSA[NY])?|L|O(?:KIA|RT(?:HWESTERNMUTUAL|ON)|W(?:RUZ|TV)?)?|P|R[AW]?|TT|U|YC|Z)|O(?:B(?:I|SERVER)|FFICE|KINAWA|L(?:AYAN(?:GROUP)?|DNAVY|LO)|M(?:EGA)?|N(?:[EG]|L(?:INE)?)|OO|PEN|R(?:A(?:CL|NG)E|G(?:ANIC)?|IGINS)|SAKA|T(?:SUKA|T)|VH)|P(?:A(?:GE|NASONIC|R(?:IS|S|T(?:NERS|[SY]))|SSAGENS|Y)?|CCW|ET?|F(?:IZER)?|G|H(?:ARMACY|D|ILIPS|O(?:NE|TO(?:GRAPHY|S)?)|YSIO)?|I(?:C(?:S|T(?:ET|URES))|D|N[GK]?|ONEER|ZZA)|K|L(?:A(?:CE|Y(?:STATION)?)|U(?:MBING|S))?|M|NC?|O(?:HL|KER|LITIE|RN|ST)|R(?:A(?:MERICA|XI)|ESS|IME|O(?:D(?:UCTIONS)?|F|GRESSIVE|MO|PERT(?:IES|Y)|TECTION)?|U(?:DENTIAL)?)?|[ST]|UB|WC?|Y)|Q(?:A|PON|UE(?:BEC|ST))|R(?:A(?:CING|DIO)|E(?:A(?:D|L(?:ESTATE|T(?:OR|Y)))|CIPES|D(?:STONE|UMBRELLA)?|HAB|I(?:SEN?|T)|LIANCE|N(?:T(?:ALS)?)?|P(?:AIR|ORT|UBLICAN)|ST(?:AURANT)?|VIEWS?|XROTH)?|I(?:C(?:H(?:ARDLI)?|OH)|[LOP])|O(?:C(?:HER|KS)|DEO|GERS|OM)?|S(?:VP)?|U(?:GBY|HR|N)?|WE?|YUKYU)|S(?:A(?:ARLAND|FE(?:TY)?|KURA|L(?:E|ON)|MS(?:CLUB|UNG)|N(?:DVIK(?:COROMANT)?|OFI)|P|RL|S|VE|XO)?|B[IS]?|C(?:[AB]|H(?:AEFFLER|MIDT|O(?:LARSHIPS|OL)|ULE|WARZ)|IENCE|OT)?|D|E(?:A(?:RCH|T)|CUR(?:E|ITY)|EK|LECT|NER|RVICES|S|VEN|W|XY?)?|FR|G|H(?:A(?:NGRILA|RP|W)|ELL|I(?:KSH)?A|O(?:ES|P(?:PING)?|UJI|W(?:TIME)?))?|I(?:LK|N(?:A|GLES)|TE)?|J|K(?:IN?|Y(?:PE)?)?|L(?:ING)?|M(?:ART|ILE)?|N(?:CF)?|O(?:C(?:CER|IAL)|FT(?:BANK|WARE)|HU|L(?:AR|UTIONS)|N[GY]|Y)?|P(?:A(?:CE)?|O(?:R)?T)|RL?|S|T(?:A(?:DA|PLES|R|TE(?:BANK|FARM))|C(?:GROUP)?|O(?:CKHOLM|R(?:AG)?E)|REAM|UD(?:IO|Y)|YLE)?|U(?:CKS|PP(?:L(?:IES|Y)|ORT)|R(?:F|GERY)|ZUKI)?|V|W(?:ATCH|ISS)|X|Y(?:DNEY|STEMS)?|Z)|T(?:A(?:B|IPEI|LK|OBAO|RGET|T(?:A(?:MOTORS|R)|TOO)|XI?)|CI?|DK?|E(?:AM|CH(?:NOLOGY)?|L|MASEK|NNIS|VA)|[FG]|H(?:D|EAT(?:ER|RE))?|I(?:AA|CKETS|ENDA|FFANY|PS|R(?:ES|OL))|J(?:MAXX|X)?|K(?:MAXX)?|L|M(?:ALL)?|N|O(?:DAY|KYO|OLS|P|RAY|SHIBA|TAL|URS|WN|Y(?:OTA|S))?|R(?:A(?:D(?:E|ING)|INING|VEL(?:CHANNEL|ERS(?:INSURANCE)?)?)|UST|V)?|T|U(?:BE|I|NES|SHU)|VS?|[WZ])|U(?:A|B(?:ANK|S)|[GK]|N(?:I(?:COM|VERSITY)|O)|OL|PS|[SYZ])|V(?:A(?:CATIONS|N(?:A|GUARD))?|C|E(?:GAS|NTURES|R(?:ISIGN|SICHERUNG)|T)?|G|I(?:AJES|DEO|G|KING|LLAS|[NP]|RGIN|S(?:A|ION)|V[AO])?|LAANDEREN|N|O(?:DKA|L(?:KSWAGEN|VO)|T(?:E|ING|O)|YAGE)|U(?:ELOS)?)|W(?:A(?:L(?:ES|MART|TER)|NG(?:GOU)?|TCH(?:ES)?)|E(?:ATHER(?:CHANNEL)?|B(?:CAM|ER|SITE)|D(?:DING)?|I(?:BO|R))|F|HOSWHO|I(?:EN|KI|LLIAMHILL|N(?:DOWS|E|NERS)?)|ME|O(?:LTERSKLUWER|ODSIDE|R(?:KS?|LD)|W)|S|T[CF])|X(?:BOX|EROX|FINITY|I(?:HUA)?N|N--(?:1(?:1B4C3D|CK2E1B|QQW23A)|2SCRJ9C|3(?:0RR7Y|BST00M|DS443G|E0B707E|HCRJ9C|PXU8K)|4(?:2C2D9A|5(?:BR(?:5CYL|J9C)|Q11C)|DBRK0CE|GBRIM)|5(?:4B7FTA0CC|5Q(?:W42G|X5D)|SU34J936BGSG|TZM5G)|6(?:FRZ82G|QQ986B3XL)|8(?:0A(?:DXHKS|O21A|QECDR1A|S(?:EHDB|WG))|Y0A063A)|9(?:0A(?:3AC|E|IS)|DBQ2A|ET52U|KRT00A)|B(?:4W605FERD|CK1B9A5DRE4C)|C(?:1AVG|2BR7G|CK(?:2B3B|WCXETD)|G4BKI|LCHC0EA0B2G2A9GCD|ZR(?:694B|S0T|U2D))|D1A(?:CJ3B|LF)|E(?:1A4C|CKVDTC9D|FVY88H)|F(?:CT429K|HBEI|IQ(?:228C5HS|64B|S8S|Z9S)|JQ720A|LW351E|PCRJ9C3D|Z(?:C2C9E2C|YS8D69UVGM))|G(?:2XX48C|CKR3F0F|ECRJ9C|K3AT1E)|H(?:2BR(?:EG3EVE|J9C(?:8C)?)|XT814E)|I(?:1B6B1A6A2E|MR513N|O0A7I)|J(?:1A(?:EF|MH)|6W193G|LQ(?:480N2RG|61U9W7B)|VR189M)|K(?:CRX77D1X4A|P(?:R(?:W13|Y57)D|UT3I))|L(?:1ACC|GBBAT1AD8J)|M(?:GB(?:9AWBF|A(?:3A(?:3EJT|4F16A)|7C0BBN0A|A(?:KC7DVF|M7A8H)|B2BD|H1A3HJKRD|I9AZGQP6J|YH7GPA)|BH1A(?:71E)?|C(?:0A9AZCG|A7DZDO|PQ6GPA1A)|ERP4A5D4AR|GU82A|I4ECEXP|PL2FH|T(?:3DHD|X2B)|X4CD0AB)|IX891F|K1BU44C|XTQ1M)|N(?:GB(?:C5AZD|E9E0A|RX)|ODE|QV7F(?:S00EMA)?|YQY26A)|O(?:3CW4H|GBPF8FL|TU796D)|P(?:1A(?:CF|I)|GBS0DH|SSY2U)|Q(?:7CE6A|9JYB4C|CKA1PMC|XA(?:6A|M))|R(?:HQV96G|OVU88B|VC1E0AM3E)|S(?:9BRJ9C|ES554G)|T(?:60B56A|CKWE|IQ49XQYJ)|UNUP4Y|V(?:ERMGENSBERAT(?:ER-CT|UNG-PW)B|HQUV|UQ861B)|W(?:4R(?:85EL8FHU5DNRA|S40L)|GB(?:H1C|L6A))|X(?:HQ521B|KC2(?:AL3HYE2A|DL3A5EE0H))|Y(?:9A3AQ|FRO4I67O|GBI2AMMX)|ZFR164B)|XX|YZ)|Y(?:A(?:CHTS|HOO|MAXUN|NDEX)|E|O(?:DOBASHI|GA|KOHAMA|U(?:TUBE)?)|T|UN)|Z(?:A(?:PPOS|RA)?|ERO|IP|M|ONE|UERICH|W))\b(?!\.\w)/?";
var output = Regex.Matches(text, pattern).Cast<Match>().Select(x => x.Value);
See the regex demo. Details:
(?i)
- case insensitive matching ON\b
- a word boundary(?:(?:http|ftp)s?://|www\.)?
- an optional sequence ofhttp
orftp
followed with an optionals
and then://
, orwww.
[\w/?=%.-]*?
- zero or more word,/
,?
,=
,%
,.
or-
chars as few as possible\.
- a.
char(?:<TLD_PATTERN>)
- a pattern that matches any TLD (listed in the IANA's TLD DB)\b
- a word boundary -(?!\.\w)
- fail the match if there is.
and a word char immediately to the right of the current location/?
- an optional/
char.