How can I get this regex right in c#?

Question

I am trying to match any blocks that has type:"Data" in it and then replace it with the text I want.
A sample input is given below, there can be one or more of these:

layer {
  name: "cifar"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mean_file: "examples/cifar10/mean.binaryproto"
    mirror: true
    #crop_size: 20 
  }

# this is a comment!
  data_param {
    source: "examples/cifar10/cifar10_train_lmdb"
    batch_size: 100
    backend: LMDB
  }
}
layer {
  name: "cifar"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    mean_file: "examples/cifar10/mean.binaryproto"
  }
  data_param {
    source: "examples/cifar10/cifar10_test_lmdb"
    batch_size: 25
    backend: LMDB
  }
}

I came up with this regex :

((layer)( *)((\n))*{((.*?)(\n)*)*(type)( *):( *)("Data")((.*?)(\n)*)*)(.*?)(\n)}

I tried to model this :

find and select a block starting with layer, 
there can be any number of space characters but after it 
there should be a { character, 
then there can be anything( for making it easier), and then 
there should be a type followed by any number of spaces, then followed by "Data"
then anything can be there, until it is faced with a } character

But clearly this does not work properly. If I change the type in any of these layer blocks, nothing gets detected!, not even the layer which has the type : "Data"

@Steve: It’s not. The keys (and some values) aren’t quoted and there are no commas, plus `#` might be being used for a comment. — Ry-, Feb 05 '17 at 09:33
In the example, `type` only appears in `layer`s, and always following `name` and preceding `top`. If this is always true, you could simplify your regex significantly. Otherwise, what data structure does this model? Does it have a tokenizer/parser/AST you can use to read the data more reliably? Or can you convert it to JSON/XML first? — Orphid, Feb 05 '17 at 09:50
What identifies the blocks that you don't want to match - do they consistently contain something that you can hook onto e.g. `type: "Foo"` and then you exclude all those blocks to be left with just the other ones which will then be the ones with `type: "Data"` ? — Robin Mackenzie, Feb 05 '17 at 10:47
@Orphid: not, Not necessarily, they can have any order. This is actually a format which Caffe framework uses internally, and I want to automate some processes, remove these sections and replace it with another section and then save the new file and feed it to caffe for further processing. — Rika, Feb 05 '17 at 10:56
@RobinMackenzie: yes, they all contain a type tag, This is a complete example : http://pastebin.com/Z9EhUfMA . as you can see, there are several types of layers, which can be identified by their types. but I guess looking for data and removing/replacing them would be easier and more hassle free. — Rika, Feb 05 '17 at 11:03

score 1 · Accepted Answer · edited May 23 '17 at 11:53

Based on this post about using .net regular expressions to do bracket matching you can adapt the regex presented:

\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)

It's looking for sets of matching ( and ) and you can simply swap those for { and } (nothing that they are escaped in that regex).

Then you can prefix the layer\s* bit.

For the feature to exclude blocks where type <> "Data" I've added a negative lookahead for all the other type keywords in your sample in the pastebin. Unfortunately adding a postitive lookahead for type: "Data" simply didn't work and I think if it did that would be your most robust solution.

Hopefully you have a finite list of type values and you can extend this for a practical solution:

layer\s*{(?>{(?<c>)|[^{}](?!type: "Accuracy"|type: "Convolution"|type: "Dropout"|type: "InnerProduct"|type: "LRN"|type: "Pooling"|type: "ReLU"|type: "SoftmaxWithLoss")+|}(?<-c>))*(?(c)(?!))}

The key bit to work with in the original regex is the [^()]+ which matches content between the brackets that are being matched by the other components of the regex. I've adapted that to [^{}]+ - being 'everything other than the braces' - and then added the long 'apart from' clause with the keywords to not match.

How can I get this regex right in c#?

1 Answers1

Linked