2

I need help with the regular expression. I have 1000's of lines in a file with the following format:

 + + [COMPILED]\SRC\FileCheck.cs                               - TotalLine:   99 RealLine:   27 Braces:   18 Comment:   49 Empty:    5
 + + [COMPILED]\SRC\FindstringinFile.cpp                                  - TotalLine:  103 RealLine:   26 Braces:   22 Comment:   50 Empty:    5
 + + [COMPILED]\SRC\findingstring.js                                - TotalLine:   91 RealLine:   22 Braces:   14 Comment:   48 Empty:    7
 + + [COMPILED]\SRC\restinpeace.h                      - TotalLine:   95 RealLine:   24 Braces:   16 Comment:   48 Empty:    7
 + + [COMPILED]\SRC\Getsomething.h++                               - TotalLine:  168 RealLine:   62 Braces:   34 Comment:   51 Empty:   21
 + + [COMPILED]\SRC\MemDataStream.hh                             - TotalLine:  336 RealLine:  131 Braces:   82 Comment:   72 Empty:   51
 + + [CONTEXT]\SRC\MemDataStream.sql                             - TotalLine:  36 RealLine:  138 Braces:   80 Comment:   76 Empty:   59

I need a regular expression that can give me:

  • FilePath i.e. \SRC\FileMap.cpp
  • Extension i.e. .cpp
  • RealLine value i.e. 17

I'm using PowerShell to implement this and been successful in getting the results back using Get-Content (to read the file) and Select-String cmdlets. Problem is its taking a long time to get the various substrings and then writing those in the xml file.(I have not put in the code for generating and xml). I've never used regular expressions before but I know using a regular expression would be an efficient way to get the strings..

Help would be appreciated.

The Select-String cmdlet accepts the regular expression to search for the string.

Current code is as follows:

    function Get-SubString
    {
        Param ([string]$StringtoSearch, [string]$StartOfTheString, [string]$EndOfTheString)
        If($StringtoSearch.IndexOf($StartOfTheString) -eq -1 )
        {
            return
        }

        [int]$StartOfIndex = $StringtoSearch.IndexOf($StartOfTheString) + $StartOfTheString.Length
        [int]$EndOfIndex = $StringtoSearch.IndexOf($EndOfTheString , $StartOfIndex)
        if( $StringtoSearch.IndexOf($StartOfTheString)-ne -1 -and $StringtoSearch.IndexOf($EndOfTheString) -eq -1 )
        {
         [string]$ExtractedString=$StringtoSearch.Substring($StartOfTheString.Length)
        }
        else
        {
        [string]$ExtractedString = $StringtoSearch.Substring($StartOfIndex, $EndOfIndex - $StartOfIndex)
        }
        Return $ExtractedString

    }

   function Get-FileExtension
   {
      Param ( [string]$Path)
      [System.IO.Path]::GetExtension($Path)
   }


 #For each file extension we will be searching all lines starting with + +
  $SearchIndividualLines = "+ + ["
   $TotalLines = select-string -Pattern $SearchIndividualLines -Path   
   $StandardOutputFilePath -allmatches -SimpleMatch

  for($i = $TotalLines.GetLowerBound(0); $i -le $TotalLines.GetUpperBound(0); $i++)

{
$FileDetailsString = $TotalLines[$i]
#Get File Path
$StartStringForFilePath = "]"

  $EndStringforFilePath =  "- TotalLine"

   $FilePathValue = Get-SubString -StringtoSearch $FileDetailsString -StartOfTheString $StartStringForFilePath -EndOfTheString $EndStringforFilePath

  #Write-Host FilePathValue is $FilePathValue

  #GetFileExtension
  $FileExtensionValue = Get-FileExtension -Path $FilePathValue
  #Write-Host FileExtensionValue is $FileExtensionValue

  #GetRealLine
  $StartStringForRealLine = "RealLine:"
  $EndStringforRealLine =  "Braces"
     $RealLineValue = Get-SubString -StringtoSearch $FileDetailsString -
     StartOfTheString $StartStringForRealLine -EndOfTheString $EndStringforRealLine
  if([string]::IsNullOrEmpty($RealLineValue))
  {
  continue
  }


}    
a6k006
  • 23
  • 2
  • 1
    Possible duplicate of [Reference - What does this regex mean?](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) – briantist Nov 10 '15 at 00:33

1 Answers1

2

Assume you have those in C:\temp\sample.txt

Something like this?

PS> (get-content C:\temp\sample.txt) | % { if ($_ -match '.*COMPILED\](\\.*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]@{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }

FilePath              Extention RealLine
--------              --------- --------
\SRC\FileCheck        .cs       27      
\SRC\FindstringinFile .cpp      26      
\SRC\findingstring    .js       22      
\SRC\restinpeace      .h        24      
\SRC\Getsomething     .h        62      
\SRC\MemDataStream    .hh       131

Update: Stuff inside paranthesis is captured, so if you want to capture [COMPILED], you will need to just need to add that part into the regex:

Instead of

$_ -match '.*COMPILED\](\\.*) 

use

$_ -match '.*(\[COMPILED\]\\.*)

The link in the comment to your question includes a good primer on the regex.

UPDATE 2 Now that you want to capture set of path, I am guessing you sample looks like this:

+ + [COMPILED]C:\project\Rom\Main\Plan\file1.file2.file3\Cmd\Camera.culture.less-Lat‌​e-PP.min.js    - TotalLine:  336 RealLine:  131 Braces:   82 Comment:   72 Empty:   51

The technique above will work, you just need to do a very slight adjustment for the first parenthesis like this:

$_ -match (\[COMPILED\].*)

This will tell regex that you want to capture [COMPILED] and everything that comes after it, until

(\.\w+)

i.e to the extension, which as a dot and a couple of letters (which might not works if you had an extension like .3gp)

So, your original one liner would instead be:

(get-content C:\temp\sample.txt) | % { if ($_ -match '.(\[COMPILED\].*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]@{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }
Adil Hindistan
  • 4,898
  • 4
  • 20
  • 25