1

I've been stuck on this for about 2 days now. Unfortunately I may only use powershell (which I'm not good at). I want to match the following criteria using regex:

hxxp://www[.]website[.]org

google.com

www.google[.]com

foob://geller.xyz

hxxps://website[.]net/tree/branch/etc

I'm looking at urls & domains (for IOCs) that are fanged and defanged. The url/domain are of all different formats except they always include a anycharacter.anycharacter . I thought the best way to match would be if the string has a period with characters on both sides to then match with the beginning and end of the string. The closest I have come is:

^.*\b[^.]+$\b

However, I'm not getting positive results with anything I've tried. I would appreciate if anyone has any ideas. To show that I'm not lazy here's what I've got for the other IOCs (I'm just stuck on this one):

#Select a file with a dialog. TXT only

Add-Type -AssemblyName System.Windows.Forms
$FileBrowser = New-Object System.Windows.Forms.OpenFileDialog -Property @{
    InitialDirectory = [Environment]::GetFolderPath('Desktop')
    Filter = 'TXT (*.txt)|*.txt'
}
[void]$FileBrowser.ShowDialog()
$FileBrowser.FileNames

#Sets file & applies set string while creating first ouput file

#First regex matches IPV4 <-- works well!
$input_path = $FileBrowser.FileNames
$output_file = ‘C:\Users\output.csv'
$regex = ‘\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file


#Second regex2 matches  domains  <- is a problem
$regex2 = '\b^.*[^.]+$\b'
select-string $input_path -Pattern $regex2 -AllMatches | % { $_.Matches } | % { $_.Value } | Out-File -FilePath C:\Users\01100\Desktop\Folder\output.csv -Append

#Third matches any file extension <--- works well!
$regex3 = '^\.[a-zA-Z0-9]+$'
select-string $input_path -Pattern $regex3 -AllMatches | % { $_.Matches } | % { $_.Value } | Out-File -FilePath C:\Users\01100\Desktop\Folder\output.csv -Append

#Fourth matches any hash  <--- works well!
$regex4 = '[A-Fa-f0-9]{15,}'
select-string $input_path -Pattern $regex4 -AllMatches | % { $_.Matches } | % { $_.Value } | Out-File -FilePath C:\Users\01100\Desktop\Folder\output.csv -Append

#Fifth matches defanged IPs  <---works well!
$regex5 = '\b\d{1,3}[^b]\.[^b]\d{1,3}[^b]\.[^b]\d{1,3}[^b]\.[^b]\d{1,3}\b'
select-string $input_path -Pattern $regex5 -AllMatches | % { $_.Matches } | % { $_.Value } | Out-File -FilePath C:\Users\01100\Desktop\Folder\output.csv -Append
  • 1
    I'm not entirely sure what you're trying to do, but you might find it easer to use the `[URI]` type accelerator - `[URI]'http://google.co.uk'` which will convert the URI into an object with some useful properties you can use to validate the String. – Jacob Oct 31 '20 at 18:26
  • Thank you for the comment, I'll check it out – Derek Herrington Oct 31 '20 at 18:49
  • Derek, I like to try my regex’s out in tools before PowerShell,JavaScript,etc. Like an online RegEx tester https://regex101.com/. It color codes each element of the RegEx. This site also has a saved library of others RegExs https://regex101.com/library?orderBy=RELEVANCE&search=url. Each RegEx engine is a little different, so you may see some differences in PowerShell. There is also a PC tool I haven’t tried yet called regexbuddy https://www.regular-expressions.info/regexbuddy.html. It seems to have PowerShell RegEx engine support. – Michael Erpenbeck Oct 31 '20 at 19:30
  • Why the brackets around the period `google[.]com`? Is that what they values look like? – marsze Oct 31 '20 at 19:31
  • Michael, thank you for the link, I'll need to go through the library -- i'm sure someone has already solved this. I'll test out regexbuddy and let you know. Thank you! – Derek Herrington Oct 31 '20 at 21:34
  • Marsze, the brackets are part of the value. In my field they call it "defanging" taking what is a malicious domain and making it unclickable. Ex. reallybadmalware.cn is defanged to reallybadmalware[.]cn so word and other apps don't apply hyperlinks to it. – Derek Herrington Oct 31 '20 at 21:35
  • As for this ... [match would be if the string has a period with characters on both sides] in a URL/URI, this would always be true except for single label ones. You are not showing what your expected results should be. As noted in my answer below, you can use the .Net namespace and see all the properties, to make decisions there, but that too has its specifics of what it will return. – postanote Oct 31 '20 at 22:19
  • @DerekHerrington, regarding regexbuddy, please do let me know. The https://regex101.com/ library has become the first place I look for edge cases that I often miss. For example isolating an explicit port number in the URL, and ensuring case insensitivity. And don’t forget to give your RegEx back to the regex101 community. – Michael Erpenbeck Oct 31 '20 at 23:30
  • Firstly, for the most part, remember RegEx is RegEx, regardless of the language used. So, don't stress the PowerShell thing, as it's just a tool, like Perl, Python, et al., not really unfortunate sort of thing. Secondly, since to be a URL/URI, a period has to exist string as does a protocol (http/https/ftp/ftps...) and a tld. So, looking for a '.' in the URL/URO string is kind of redundant. Also, many Url/Uri that bad / and good guy's use are ofter us ASCII strings instead of the representation shown, so your risk inspection is missing that unless you are doing that elsewhere. – postanote Nov 02 '20 at 05:05
  • Also of note, bot the PowerShell ISE and VSCode have RegEx add-ons/extensions that you can use in real-time in your development effort. No need to just out to another source, unless you choose to. Yet as the old cliche goes; 'If you have a problem that which you feel the need to RegEx it, now you have two problems.' ;-} – postanote Nov 02 '20 at 05:08
  • @postanote Regarding RegEx is RegEx, the RegEx engines can be different depending on the underlying engine’s implementation. At a surface level you are correct that the syntax is fairly standardized and most users won’t see issues. However, recursion, named capture, and greediness can have a lot of variation. I have been bitten by each of these many times when moving RegExs from one language to another. See the following for more information: https://en.m.wikipedia.org/wiki/Comparison_of_regular-expression_engines – Michael Erpenbeck Nov 04 '20 at 02:00
  • Yeppers, they can, but in very rare cases have I run into a scenario where I could not use any RegEx I've pulled together to date and not have it work as expected across all languages I've used. In those cases, it was always the Look(Ahead/Behind/Around) stuff, well, some Java* stuff have caused me some hiccups. Yet, on the PS side (since the Monad days) of the fence, no catch 22 to date. – postanote Nov 04 '20 at 03:15

2 Answers2

1

If I understand correctly, you want to match all lines that represent a domain name or url? You will find out, that is not a trivial matter. There exist various examples of regular expressions to validate domain names or urls (look for example here or here). But the more accurate they are required to be, the more complex they will become.

In your case, it will be even more difficult, because you have different formats (sometimes with or without scheme, or query string).

How accurate your regex needs to be depends on your use case and how much work you are willing to put into it. Based on your example and your question title, I suppose you want a very basic version.

I suggest this one, it should work for the most common cases:

'^([a-z0–9-]+://)?([a-z0–9-]+\.)+[a-z0–9-]+(/.*)?$'

Short explanation:

([a-z0–9-]+://)? check for optional schema at the start (no particular one)

([a-z0–9-]+\.)+[a-z0–9]+ domain incl. optional subdomains, followed by top-level-domain

(/.*)? match optional query string (without validation)

If you need more accuracy, you can use this regex as a first step to filter the input, and then perform further tests on the input strings. You could validate if it's a valid url, or check if the domain name exists.

marsze
  • 11,092
  • 5
  • 33
  • 50
1

If you are using brackets as a standard for defanged URL/URI, then just look for those. If they are not there, then of course the URL/URL is still hot.

Clear-Host
(@'
hxxp://www[.]website[.]org

google.com

www.google[.]com

foob://geller.xyz

hxxps://website[.]net/tree/branch/etc

.foob://geller.xyz
foob://geller.xyz.
'@) -split "`n" | 
ForEach-Object {
    $Url = $PSitem
    Try {Write-Warning -Message "Defanged URL $(([regex]::Matches($Url, '.*\[\.\].*').Value))"}
    Catch {Write-Verbose "Fanged URL : $Url" -Verbose}
}
# Results
<#
WARNING: Defanged URL hxxp://www[.]website[.]org
VERBOSE: Fanged URL : 
VERBOSE: Fanged URL : google.com
VERBOSE: Fanged URL : 
WARNING: Defanged URL www.google[.]com
VERBOSE: Fanged URL : 
VERBOSE: Fanged URL : foob://geller.xyz
VERBOSE: Fanged URL : 
WARNING: Defanged URL hxxps://website[.]net/tree/branch/etc
VERBOSE: Fanged URL : 
VERBOSE: Fanged URL : .foob://geller.xyz
VERBOSE: Fanged URL : foob://geller.xyz.
#>

If you are just trying to exclude the strings with a 'period' at the beginning or end of a string, regardless of what the string is, then this is a working example of that, utilizing RegEx 'not' expression.

Clear-Host
(@'
hxxp://www[.]website[.]org

google.com

www.google[.]com

foob://geller.xyz

hxxps://website[.]net/tree/branch/etc

.foob://geller.xyz
foob://geller.xyz.
'@) -split "`n" | 
ForEach-Object {
Try{([regex]::Matches($PSitem, '^((?!(((^\..*)|(.*\.$)))).)*$')).Value}
Catch{}
}
# Results
<#
hxxp://www.website.org
google.com
www.google.com
foob://geller.xyz
hxxps://website.net/tree/branch/etc
#>

...or doing the reverse, then:

Clear-Host
(@'
hxxp://www[.]website[.]org

google.com

www.google[.]com

foob://geller.xyz

hxxps://website[.]net/tree/branch/etc

.foob://geller.xyz
foob://geller.xyz.
'@) -split "`n" | 
ForEach-Object {
Try{([regex]::Matches($PSitem, '^((^\..*|(.*\.$)))')).Value}
Catch{}
}
# Results
<#
.foob://geller.xyz
foob://geller.xyz.
#>

Or skip all the RegEx altogether and say this..

Clear-Host
(@'
hxxp://www[.]website[.]org

google.com

www.google[.]com

foob://geller.xyz

hxxps://website[.]net/tree/branch/etc

.foob://geller.xyz
foob://geller.xyz.
'@) -split "`n" | 
ForEach-Object {
    If($PSItem[0] -eq '.' -or $PSItem[-1] -eq '.'){}
    Else {$PSItem}
}
# Results
<#
hxxp://www[.]website[.]org

google.com

www.google[.]com

foob://geller.xyz

hxxps://website[.]net/tree/branch/etc

 

#>

You could also look to using the .Net namespace to see what is being returned and use the properties to make your decision.

Clear-Host
@'
hxxp://www[.]website[.]org
google.com
www.google[.]com
foob://geller.xyz
hxxps://website[.]net/tree/branch/etc
'@ -split "`n" | 
ForEach {
    Try {($PSItem.trim() -as [System.URI])}
    Catch {$PSItem.Exception.Message}
}
# Results
<#
AbsolutePath   : 
AbsoluteUri    : 
LocalPath      : 
Authority      : 
HostNameType   : 
IsDefaultPort  : 
IsFile         : 
IsLoopback     : 
PathAndQuery   : 
Segments       : 
IsUnc          : 
Host           : 
Port           : 
Query          : 
Fragment       : 
Scheme         : 
OriginalString : google.com
DnsSafeHost    : 
IdnHost        : 
IsAbsoluteUri  : False
UserEscaped    : False
UserInfo       : 

AbsolutePath   : 
AbsoluteUri    : 
LocalPath      : 
Authority      : 
HostNameType   : 
IsDefaultPort  : 
IsFile         : 
IsLoopback     : 
PathAndQuery   : 
Segments       : 
IsUnc          : 
Host           : 
Port           : 
Query          : 
Fragment       : 
Scheme         : 
OriginalString : www.google[.]com
DnsSafeHost    : 
IdnHost        : 
IsAbsoluteUri  : False
UserEscaped    : False
UserInfo       : 

AbsolutePath   : /
AbsoluteUri    : foob://geller.xyz/
LocalPath      : /
Authority      : geller.xyz
HostNameType   : Dns
IsDefaultPort  : True
IsFile         : False
IsLoopback     : False
PathAndQuery   : /
Segments       : {/}
IsUnc          : False
Host           : geller.xyz
Port           : -1
Query          : 
Fragment       : 
Scheme         : foob
OriginalString : foob://geller.xyz
DnsSafeHost    : geller.xyz
IdnHost        : geller.xyz
IsAbsoluteUri  : True
UserEscaped    : False
UserInfo       : 
#>
postanote
  • 11,533
  • 2
  • 7
  • 16