6

Is there a way to write a rule based system to catch things like start/end dates from a contract text. Here are a few real examples. I am bolding the date entities which I want spacy to automatically detect. If you have other ideas different than spacy that is also OK!

  1. The initial term of this Lease shall be for a period of Five (5) years commencing on February 1, 2012, (the “Lease Commencement Date”) and expiring on January 31, 2017 (the “Initial Lease Term”).

  2. Term: One (1) year commencing January 1, 2007 ("Commencement Date") and ending December 31, 2007 ("Expiration Date").

  3. This Lease Agreement is entered into for term of 15 years, beginning January 1, 2014 and ending on December 31, 2028.

yishairasowsky
  • 476
  • 5
  • 17
  • Dates can be super complicated. Can you be certain that you will only be looking for dates in the format `MonthName dayNum, 4DigitYear`? – user1558604 Dec 15 '19 at 13:31
  • No guarantee what format it will be in. Could be MONTH, DAY, YEAR, or MM/DD/YYYY for example. – yishairasowsky Dec 15 '19 at 14:11
  • That makes it more difficult. Could it also be DD/MM/YYYY or DD/MM/YY, or YYYY/MM/DD, or YY/MM/DD? This is why dates are complicated in programing. – user1558604 Dec 15 '19 at 14:14
  • oh, i actually wasn't worried about this detail, because one could just submit the date to dateutil.parser and see if it is recognized... – yishairasowsky Dec 15 '19 at 14:37
  • 1
    But you still need to recognize it as a date. You can't do that without knowing all the formats that a date could be in. – user1558604 Dec 15 '19 at 14:38
  • 1
    i know spacy can recognize it as a date. i just want to subselect for those dates which are start/end dates. – – yishairasowsky Dec 16 '19 at 10:58

2 Answers2

4

I think you have to make a clear distinction between two types of methods:

1) Statistical models / Machine Learning, a.k.a. NER models. These will take the context of the sentence into account when trying to figure out whether a specific token, or multiple consecutive tokens, are a date. spaCy has pre-built NER models you can download to try out on your specific data. You'll want to look for those entities (in doc.ents) that have ent.label_ == DATE. Once you have those entities, you can run them through a date parser to understand what the actual date is. See also here for more information.

2) Rule-based entity recognition. Here, you have to define the rules yourself by specifying how you expect your date will look like, e.g. XX/XX/XXXX with X being a digit. As user1558604 pointed out though, you'll have to write multiple different rules if you want to recognize different representations of dates. You can find an overview of spaCy's rule-based matching methods here.

Sofie VL
  • 2,032
  • 1
  • 8
  • 16
  • 1
    Thanks! Right now, we have a set of rules that select the start and end dates from all the spacy recognized dates. We want to make a more sophisticated rule-based approach before going to machine learning though. A few reasons for this: 1) we will establish a baseline accuracy/recall threshold to which we can compare future statistical models; 2) we will discover more about the problem and better understand its subtleties; 3) we can use the rule based approach to help efficiently label data for future training. maybe we should use the parsing tool?love to hear your thoughts. thanks! – yishairasowsky Dec 16 '19 at 10:45
  • Ok so if I understand you correctly, you are already using the NER models in spaCy, and you want the rules to look at the surrounding sentence and extract begin/end clues ? – Sofie VL Dec 17 '19 at 07:51
  • I think using the parser could definitely help you. Actually, spaCy has a currently experimental and undocumented `DependencyMatcher` that could be useful to you. See also https://stackoverflow.com/questions/57664264/how-to-match-dependency-patterns-with-spacy and https://github.com/explosion/spaCy/issues/4433 – Sofie VL Dec 17 '19 at 07:56
0

You can use SUTime from CoreNLP to do it easily: https://github.com/FraBle/python-sutime

anas17in
  • 153
  • 9
  • because i do not know how to use that software. is it even in python, or just java? – yishairasowsky Dec 15 '19 at 14:39
  • This library is a python wrapper on top of orginal java implementation. You can use it via python. If you go through the link in my answer, you will get the installation instruction and sample code for it. – anas17in Dec 16 '19 at 05:10