parsing multiple lines with regex

Question

I'm writing a program in Java that parse bibtex library file. each entry should be parsed to field and value. this is an example of one single bibtex from a library.

@INPROCEEDINGS{conf/icsm/Ceccato07,
  author = {Mariano Ceccato},
  title = {Migrating Object Oriented code to Aspect Oriented Programming},
  booktitle = {ICSM},
  year = {2007},
  pages = {497--498},
  publisher = {IEEE},
  bibdate = {2008-11-18},
  bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/icsm/icsm2007.html#Ceccato07},
  crossref = {conf/icsm/2007},
  owner = {Administrator},
  timestamp = {2009.04.30},
  url = {http://dx.doi.org/10.1109/ICSM.2007.4362668}
}

in this case, I just read the line and split it using the method split. for example, the first entry (author) is parsed like this:

Scanner in = new Scanner(new File(library.bib));
in.nextLine();                                        //skip the header
String input = in.nextLine();                         //read (author = {Mariano Ceccato},)
String field = input.split("=")[0].trim();            //field = "author"
String value = input.split("=")[1];                   //value = "{Mariano Ceccato},"
value = value.split("\\}")[0];                        //value = "{Mariano Ceccato"
value = value.split("\\{")[1];                        //value = "Mariano Ceccato"
value = value.trim;                                   //remove any white spaces (if any)

up to know every thing is good. However there are a bibtex in the library that has multiple lines' value:

@ARTICLE{Aksit94AbstractingCF,
  author = {Mehmet Aksit and Ken Wakita and Jan Bosch and Lodewijk Bergmans and
  Akinori Yonezawa },
  title = {{Abstracting Object Interactions Using Composition Filters}},
  journal = {Lecture Notes in Computer Science},
  year = {1994},
  volume = {791},
  pages = {152--??},
  acknowledgement = {Nelson H. F. Beebe, Center for Scientific Computing, University of
  Utah, Department of Mathematics, 110 LCB, 155 S 1400 E RM 233, Salt
  Lake City, UT 84112-0090, USA, Tel: +1 801 581 5254, FAX: +1 801
  581 4148, e-mail: \path|beebe@math.utah.edu|, \path|beebe@acm.org|,
  \path|beebe@computer.org|, \path|beebe@ieee.org| (Internet), URL:
  \path|http://www.math.utah.edu/~beebe/|},
  bibdate = {Mon May 13 11:52:14 MDT 1996},
  coden = {LNCSD9},
  issn = {0302-9743},
  owner = {aljasser},
  timestamp = {2009.01.08}
}

as you see, the acknowledgement field it more than a line, so I can't read it using nextLine(). My parsing function works fine with it if I passed it as a String to it. So what is the best way to read this entry and other multiple lines entry and stile be able to read single line entries ?

score 0 · Answer 1 · answered Sep 13 '14 at 14:02

0

For these king of issues, it is always better to use a specific parser. I googled for bibtex parser and find this.

If you like to have your own as what you are doing, one sulotion to this problem is to check whether the line ends with }, if not append the current line with the next one.

Having said that, there might be other issues, that's why I suggested using a parser

answered Sep 13 '14 at 14:02

EurikaIam

136
8

this isn’t an option, I've been asked to write it myself. I found that all but a very few (maybe even one) of theme ends with this regex "\},{0,1}\r\n". Thanks anyway. – fcm2009 Sep 13 '14 at 14:22

score 0 · Accepted Answer · answered Sep 19 '14 at 09:38

The form of these entries is

@<type>{<Id>
<name>={<value>},
....
<name>={<value>}
}

Note that the last name-value pair is not followed by a comma.

If a value is split over several lines, then that simply means that a particular line does not yet contain the closing brace. In that case, scan the next line and append it to the string you are about to split. Keep doing this until the last characters in the string are "}," or "}" (this latter would happen if the 'acknowledgement' was the last name-value pair in the record).

For extra safety, count that the number of closing braces matches the number of opening braces, and keep appending lines to your string until it does. This would be to cover situations where you have a long title in an article that happened to unfortunately break at the wrong place, such as

title = {{Abstracting Object Interactions Using Composition Filters, and other stuff}
},

parsing multiple lines with regex

2 Answers2