Regular expression to filter out URLs with a literal dot after the last slash

Question

I need the regex to identify urls that after the last forward slash

have a literal dot, such as

http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4

do not have a literal dot, such as
```
http://www.example.es/cat1/cat2/cat3
```

So far I have only found the regular expression for matching everything before ^(.*[\\\/]) or after the last forward slash: [^/]+$ as well as to match everything after a literal point after the last slash (?!.*\.)(.*) Yet I am unable to come out with the above, please help.

If this is the requested URL, then the fragment identifier (ie. everything after the `#` eg `.Rh1-js_4`) is not actually passed to the server, so you can't check this server-side. (?) — MrWhite, Aug 13 '15 at 00:19

zmo · Answer 1 · 2015-08-13T00:54:10.237

Well, as usual, using a regex to match an URL is the wrong tool for the wrong job. You can use urlparse (or urllib.parse in python3) to do the job, in a very pythonic way:

>>> from urlparse import urlparse
>>> urlparse('http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4')
ParseResult(scheme='http', netloc='www.example.es', path='/cat1/cat2/some-example_DH148439', params='', query='', fragment='.Rh1-js_4')
>>> urlparse('http://www.example.es/cat1/cat2/cat3')
ParseResult(scheme='http', netloc='www.example.es', path='/cat1/cat2/cat3', params='', query='', fragment='')

and if you really want a regex, the following regex is an example that would answer your question:

import re
>>> re.match(r'^[^:]+://([^.]+\.)+[^/]+/([^/]+/)+[^#]+(#.+)?$', 'http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4') != None
True
>>> re.match(r'^[^:]+://([^.]+\.)+[^/]+/([^/]+/)+[^#]+(#.+)?$', 'http://www.example.es/cat1/cat2/cat3') != None
True

but the regex I'm giving is good enough to answer your question, but is not a good way to validate an URL, or to split it in pieces. I'd say its only interest is to actually answer your question.

Here's the automaton generated by the regex, to better understand it:

Regular expression visualization

Beware of what you're asking, because JL's regex won't match:

http://www.example.es/cat1/cat2/cat3

as after rereading your question 3×, you're actually asking for the following regex:

\/([^/]*)$

which will match both your examples:

http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4
http://www.example.es/cat1/cat2/cat3

What @jl-peyret suggests, is only how to match a litteral dot following a /, which is generating the following automaton:

Regular expression visualization

So, whatever you really want:

use urlparse whenever you can to match parts of an URL
if you're trying to define a django route, then trying to match the fragment is hopeless
next time you do a question, please make it precise, and give an example of what you tried: help us help you.

Why wouldn't that work in django? Besides the fact that the fragment part is only being used on the client side, the OP's question is pretty clear: he wants to match URLs. I'm not trying to guess what the OP is trying to do, and @w3d already made a comment about the fact that the fragment part is going to be eaten. — zmo, Aug 13 '15 at 00:30
I mean that in django routing you *need* to use a regex. Otherwise a split on / followed by a . check would do. — JL Peyret, Aug 13 '15 at 00:31
indeed, though it's not what the OP is asking, whatever he's actually asking… — zmo, Aug 13 '15 at 00:54
Well, I don't disagree with the old joke that sometimes a programmer starting out with one problem to solve and using a regex on it now has 2 problems... BTW, I did test that my regex did **not** match his cat2/cat3, but I really couldn't figure out from the question how you could ask for a regex that both needed a '.' and also didn't care if there was a '.' so I assumed he forgot to mention cat2/cat3 was **not** supposed to match. Question: where did you get your state diagram from? I often can use help on regexes. — JL Peyret, Aug 13 '15 at 01:25
Some guy wrote http://debuggex.com which is great, have a try, you can run your regex through the diagram it's quite fun. Though, it is *NOT* a state diagram, it is an *automaton*, more precisely an NFA. To read more about that, I have an old answer about regex where I gave [a few useful resources](http://stackoverflow.com/a/23654544/1290438) you might want to read. ;-) — zmo, Aug 13 '15 at 01:31

score 1 · Answer 2 · answered Aug 13 '15 at 00:30

1

\/([^\/]*\.+[^\/]*)$

The first / forces you to look after it. The $ forces end of string and both class negations avoid any / between. check @ https://regex101.com/

answered Aug 13 '15 at 00:30

JL Peyret

7,549
2
34
48

score 1 · Answer 3 · answered Aug 13 '15 at 01:53

I would use a look-ahead like so

(?=.*\.)([^/]+$)

Demo

(?=             # Look-Ahead
  .             # Any character except line break
  *             # (zero or more)(greedy)
  \.            # "."
)               # End of Look-Ahead
(               # Capturing Group (1)
  [^/]          # Character not in [/] Character Class
  +             # (one or more)(greedy)
  $             # End of string/line
)               # End of Capturing Group (1)

or a negative look-ahead like so

(?!.*\.)([^/]+$)

for the opposite case

Regular expression to filter out URLs with a literal dot after the last slash

3 Answers3