1

Given the following SSH urls:

git@github.com:james/example
git@github.com:007/example
git@github.com:22/james/example
git@github.com:22/007/example

How can I pull the following:

{user}@{host}:{optional port}{path (user/repo)}

As you can see in the example, one of the usernames is numeric and NOT a port. I can't figure out how to workaround that. A port isn't always in the URL too.

My current regex is:

^(?P<user>[^@]+)@(?P<host>[^:\s]+)?:(?:(?P<port>\d{1,5})\/)?(?P<path>[^\\].*)$

Not sure what else to try.

ThatGuy343
  • 2,194
  • 2
  • 23
  • 49

2 Answers2

3

Lazy quantifiers to the rescue!

This seems to work well and satisfies the optional port:

^
(?P<user>.*?)@
(?P<host>.*?):
(?:(?P<port>.*?)/)?
(?P<path>.*?/.*?)
$

The line breaks are not part of the regex because the /x modifier is enabled. Remove all line breaks if you are not using /x.

https://regex101.com/r/wdE30O/5


Thank you @Jan for the optimizations.

MonkeyZeus
  • 18,445
  • 3
  • 30
  • 67
  • @ThatGuy343 correct, `007` comes after the colon `:` – MonkeyZeus Aug 28 '19 at 19:07
  • @ThatGuy343 if `007` is not the port then which capture group should contain it? Should that entry be ignored altogether since it is "invalid"? My goal was to create a simple parser, not a validator – MonkeyZeus Aug 28 '19 at 19:08
  • 007 is a username, and part of the path. the path group is like this: `username/repo` – ThatGuy343 Aug 28 '19 at 19:09
  • @ThatGuy343 Now I am even more confused. Please edit your question and mark the expected user, host, port, and path captures from each example you've listed. – MonkeyZeus Aug 28 '19 at 19:11
  • @ThatGuy343 so the colon is always guaranteed? – MonkeyZeus Aug 28 '19 at 19:21
  • 1
    @MonkeyZeus +1 I think you are close, but what if git@github.com:22/james ... It will match path as 22/james, when should just be james I believe... – vs97 Aug 28 '19 at 19:35
  • @vs97 my intent was to build a dumb parser, not validator. OP's question is still hazy at best until they can expand upon it per my request. Also, your example doesn't seem to fit any of theirs – MonkeyZeus Aug 28 '19 at 19:41
  • Really great work man, thank you for this. The trailing slash is included in the match but this small tweaked version fixes that: `^(?P.*?)@(?P.*?):(?:(?:(?P.*?)\/)?(?P.*?\/.*?))$` – ThatGuy343 Aug 28 '19 at 20:02
  • 1
    @MonkeyZeus: +1 but a few optimizations: a) use verbose mode, b) the non-capturing as you're using it now is superfluous and c) get rid of the forward slash for the port. All in all see https://regex101.com/r/wdE30O/5 – Jan Aug 29 '19 at 06:55
  • @Jan Hmm, never thought to make use of `/x` but it really makes a huge legibility difference. I think that trying to do it all in one line contributed to me having the slash in the port ¯\\_(ツ)_/¯ – MonkeyZeus Aug 29 '19 at 11:54
  • @ThatGuy343 Check out the update. Jan made some great suggestions. – MonkeyZeus Aug 29 '19 at 11:56
  • @MonkeyZeus: You're welcome. Indeed, the verbose flag makes the expressions more readable and helps me a lot when debugging code. – Jan Aug 29 '19 at 12:08
2

If you're on Python, you could write your very own parser:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

data = """git@github.com:james/example
git@github.com:007/example
git@github.com:22/james/example
git@github.com:22/007/example"""

class GitVisitor(NodeVisitor):
    grammar = Grammar(
        r"""
        expr        = user at domain colon rest

        user        = word+
        domain      = ~"[^:]+"
        rest        = (port path) / path

        path        = word slash word
        port        = digits slash

        slash       = "/"
        colon       = ":"
        at          = "@"
        digits      = ~"\d+"
        word        = ~"\w+"

        """)

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_user(self, node, visited_children):
        return {"user": node.text}

    def visit_domain(self, node, visited_children):
        return {"domain": node.text}

    def visit_rest(self, node, visited_children):
        child = visited_children[0]
        if isinstance(child, list):
            # first branch, port and path
            return {"port": child[0], "path": child[1]}
        else:
            return {"path": child}

    def visit_path(self, node, visited_children):
        return node.text

    def visit_port(self, node, visited_children):
        digits, _ = visited_children
        return digits.text

    def visit_expr(self, node, visited_children):
        out = {}
        _ = [out.update(child) for child in visited_children if isinstance(child, dict)]
        return out

gv = GitVisitor()
for line in data.split("\n"):
    result = gv.parse(line)
    print(result)

Which would yield

{'user': 'git', 'domain': 'github.com', 'path': 'james/example'}
{'user': 'git', 'domain': 'github.com', 'path': '007/example'}
{'user': 'git', 'domain': 'github.com', 'port': '22', 'path': 'james/example'}
{'user': 'git', 'domain': 'github.com', 'port': '22', 'path': '007/example'}

A parser allows for some ambiguity which you obviously have here.

Jan
  • 38,539
  • 8
  • 41
  • 69
  • 1
    You know, I started participating in the regex tag a little over 2 months ago to improve my skills and I feel like I've made good progress; not sure if I am an expert but at least regex doesn't seem overwhelmingly cryptic anymore (I owe most of this to using regex visualizers). Never did I think that I would get into parsers but your post is so nonchalantly "meh, try this parser" that this might be my next adventure, lol – MonkeyZeus Aug 29 '19 at 12:15
  • @MonkeyZeus: Glad to have another traveler here. A good starting point is https://tomassetti.me/guide-parsing-algorithms-terminology/ for a general overview and https://github.com/erikrose/parsimonious for a PEG parser in `Python`. It really widens the possibilities. – Jan Aug 29 '19 at 13:12
  • 1
    @MonkeyZeus: Shameless self-advertisement: you could very well combine both regular expressions and a parser to have the best of both worlds, see https://stackoverflow.com/a/57749422/1231450 – Jan Sep 02 '19 at 06:20