1

Let's say I have a HUGE file. This huge file contains a bunch of code. In this code every function is documented in XML format. Each documentation is enclosed in 'documentation comments' (/** and **/). I want a regular expression that will remove all code that is not between documentation comments (that would also include the documentation comments but I can remove them afterwards if needed).

Example of part of the script:

/**--------------------------------------------------------------------------**\
<summary>FunctionName</summary>
<returns>
    Returns 1 on success.
    Returns 0 on failure.
</returns>
<remarks>
    This function is a function.
</remarks>
\**--------------------------------------------------------------------------**/

int FunctionName()
{
    int X = 1;
    if(X == 1)
        return 1;
    return 0;
}

Expected output:

<summary>FunctionName</summary>
<returns>
    Returns 1 on success.
    Returns 0 on failure.
</returns>
<remarks>
    This function is a function.
</remarks>
Mogsdad
  • 40,814
  • 19
  • 140
  • 246
Corey Iles
  • 143
  • 14
  • 1
    So you want to create a document consisting of the documentation comments alone? Don't think of it as *removing* or *excluding* the parts you don't want, just match the parts you **do** want and write them to a new file. – Alan Moore Nov 07 '15 at 20:36

1 Answers1

1

You can use this pattern:

/^\/(\*\*-+\*\*)\\$(.*?)^\\\1\/$|./gsm

and replace with $2.

Working example: https://regex101.com/r/fA8bP0/1

The trick is basically the same as in Regex Pattern to Match, Excluding when… / Except between - use alternation to match what we need, and skip over everything we don't want.

Some notes about the pattern:

  • ^ and $ are not strictly needed - it depends on whether the comments are on a whole line. You can remove them, and remove the /m (multiline) flag.
  • \/(\*\*-+\*\*)\\ matches a whole line of a comment, /**-------**\.
  • We assume these is the same number of hyphens at the beginning of the block as at the end, and capture it to \1. If this is not correct, use \*\*-+\*\* again instead of \1. If you have a fixed number of hyphens, you can use -{74}.
  • The interesting content is captured to $1.
  • Everything else is matched by the ., and replaced away.
  • Caveat: this pattern may fail in the usual ways - strings that contain "/**-", commented code that looks like documentation, escaped characters, etc.
Community
  • 1
  • 1
Kobi
  • 125,267
  • 41
  • 244
  • 277