-1

I'd like to extract data from a String, and this String sometimes appears in different ways. For example, it could be any of the following:

Portaria n° 200, 28 de janeiro de 2018.

Portaria n° 200, 28 de janeiro de 2018 da Republica Brasileira.

Portaria n° 200 28 de janeiro de 2018.

Portaria n° 200 2017/2018 de 28 de janeiro de 2018.

There is no pattern. I have tried xsplit: it works in some cases, but it does not work all the time.

    String receberTextoIdentifica = (xmlUtil.xpathElement(documentOrigem, Constantes.GETIDENTIFICACAO).getTextContent());
    LocalDateTime receberDataEnvio = materiaDto.getDataEnvio();
    Integer receberDataEnvioAno = receberDataEnvio.getYear();
    if (receberTextoIdentifica != null && receberTextoIdentifica.toLowerCase().contains("" + receberDataEnvioAno)) {
        Element dataTexto = documentDestino.createElement("dataTexto");
        estruturas.appendChild(dataTexto);
        receberTextoIdentifica = receberTextoIdentifica.substring(0, receberTextoIdentifica.indexOf("" + receberDataEnvioAno) + 4);
        String words[] = receberTextoIdentifica.split(" ");
        String lastFive = words[words.length - 5] + " " + words[words.length - 4] + " " + words[words.length - 3] + " "
                + words[words.length - 2] + " " + words[words.length - 1];
        dataTexto.setTextContent(lastFive);
Ole V.V.
  • 65,573
  • 11
  • 96
  • 117
  • From these strings what data are you trying to extract ? Please provide an example. – CodeIt Dec 27 '18 at 13:17
  • I'd like to extract the date "28 de janeiro de 2018" @codelt – Philippe Sousa Dec 27 '18 at 13:19
  • "28 de janeiro de 2018" will this be a fixed string always or it will change .. – Rajas Dec 27 '18 at 13:20
  • Use `String str1 = "Portaria n° 200, 28 de janeiro de 2018"; String str1_array [] = str1.split(" ");`. The split function create a array of words from the string. You can then write code to extract the required data from the String Array. – CodeIt Dec 27 '18 at 13:23
  • @Rajas this pattern will be always the same, just will change the date, for example "24 de setembro de 2018" or "15 de novembro de 2017" – Philippe Sousa Dec 27 '18 at 13:24
  • @Codelt I'm using but is not working for me because I recieve a string in many different ways so I have no pattern – Philippe Sousa Dec 27 '18 at 13:28
  • @PhilippeSousa Create another array of string with month names in Portuguese. Loop through each word in the resultant string array got using split function and match it with the month names. When you find a match take that index no. and create a string concatenating `str[index-2] + str[index-1] + str[index] + str[index+1] + str[index+2]`. This should probably help you. – CodeIt Dec 27 '18 at 13:32
  • You might want to look into this: https://stackoverflow.com/questions/13367066/date-extraction-from-text – Nikhil Dec 27 '18 at 13:32
  • 1
    You need to use a regular expression. This will give you an idea https://stackoverflow.com/questions/15491894/regex-to-validate-date-format-dd-mm-yyyy – Sergio Muriel Dec 27 '18 at 13:37

2 Answers2

1

First use a regular expression for finding the date in the string, next use a DateTimeFormatter for parsing it into a LocalDate:

    Pattern datePattern = Pattern.compile("\\d{1,2} de [a-zç]{4,9} de \\d{4}");
    DateTimeFormatter portugueseDateFormatter
            = DateTimeFormatter.ofLocalizedDate(FormatStyle.LONG)
                    .withLocale(Locale.forLanguageTag("pt-BR"));

    String[] differentStrings = {
            "Portaria n° 200, 28 de janeiro de 2018.",
            "Portaria n° 200, 28 de janeiro de 2018 da Republica Brasileira.",
            "Portaria n° 200 28 de janeiro de 2018.",
            "Portaria n° 200 2017/2018 de 28 de janeiro de 2018."
    };

    for (String s : differentStrings) {
        Matcher m = datePattern.matcher(s);
        if (m.find()) {
            String dateString = m.group();
            LocalDate date = LocalDate.parse(dateString, portugueseDateFormatter);
            System.out.println("Date found: " + date);
        } else {
            System.out.println("No date found in " + s);
        }
    }

Output is:

Date found: 2018-01-28
Date found: 2018-01-28
Date found: 2018-01-28
Date found: 2018-01-28

The regular expression accepts one or two digits for day of month, then de (with space before and after), four to nine lowercase letters of month name including ç as in março (March), deagain and a four digit year.

You will probably want to catch a DateTimeParseException from parsing and possibly even try to find again to see if the real date comes later in the string.

Ole V.V.
  • 65,573
  • 11
  • 96
  • 117
  • If you could reopen this, i will be able to post an alternate solution. https://repl.it/repls/SplendidEthicalObjects – CodeIt Dec 27 '18 at 13:52
  • 1
    @CodeIt Please do. This one was on the boarder of a very exact duplicate, so I figure your solution makes more sense here than as an answer to the original I had linked to. I know that answering and closing is bad style anyway — only discovered the original after having posted my answer. Sorry. – Ole V.V. Dec 27 '18 at 13:55
  • Thanks! Posted [here](https://stackoverflow.com/a/53946269/3091398). – CodeIt Dec 27 '18 at 14:12
  • 1
    Thx a lot it helped me! I did something very similar... Matcher m = Pattern.compile("([0-9]{1,2}\\s+d\\s?e?\\s+\\&?\\&?[a-zà-ü]{4,9}\\s+de\\s+[0-9]{4}\\,?)", Pattern.CASE_INSENSITIVE) .matcher(receberTextoIdentifica); – Philippe Sousa Dec 27 '18 at 15:07
  • I need include this "28 de janeiro de 2018" on my xml tag with month. I can't use with parse with this formmat.... Thx a lot!! – Philippe Sousa Dec 27 '18 at 15:09
  • Just curious, @PhilippeSousa, what’s `\\&?\\&?`? Thx for reporting back. – Ole V.V. Dec 27 '18 at 15:29
1

An alternate way to the one suggested by @Ole.

The method get the data from the string as it is without converting it into date object.

Code:

import java.util.Scanner;
import java.util.Arrays;
import java.util.List;

class Main {

  public static void main(String[] args) {

  String[] strs = {
            "Portaria n° 200, 28 de janeiro de 2018",
            "Portaria n° 200, 28 de janeiro de 2018 da Republica Brasileira",
            "Portaria n° 200 28 de janeiro de 2018",
            "Portaria n° 200 2017/2018 de 25 de janeiro de 2018"
    };

    String months[] = {"janeiro", "fevereiro", "marco", "abril", "maio", "junho", "julho", "agosto", "setembro", "outubro", "novembro", "dezembro"};

    int i,j; 

    for(i = 0; i < strs.length; i++) {
      String test_array [] = strs[i].split(" ");

      for (j = 3; j < test_array.length - 2; j++) {
        if(Arrays.asList(months).contains(test_array[j])) {
          System.out.println(test_array[j-2]+ " " + test_array[j-1]+" " +test_array[j]+ " " +test_array[j+1]+ " " +test_array[j+2]);
        }
      }
    }
  }
}

Output:

28 de janeiro de 2018
28 de janeiro de 2018
28 de janeiro de 2018
25 de janeiro de 2018

See this in action here.

CodeIt
  • 2,882
  • 3
  • 19
  • 33
  • 1
    Thx a lot!! I'm gonna try it !! – Philippe Sousa Dec 27 '18 at 15:08
  • @OleV.V. Thank you for your encouragement. – CodeIt Dec 28 '18 at 08:52
  • I’m a bit in doubt about the upvote. It’s certainly a perfectly valid answer. I still think that it’s best to put dates into `LocalDate` objects rather than strings (except for some “run once and throw away” kind of programs). In any case it’s good to have different solutions to consider and to choose from. An advantage is you’re more following the OP’s way of doing it. – Ole V.V. Dec 28 '18 at 08:55
  • 1
    @OleV.V. You are absolutely right but the OP has accepted it. In fact the OP tried to do a similar method `String lastFive = words[words.length - 5] + " " + words[words.length - 4] + " " + words[words.length - 3] + " " + words[words.length - 2] + " " + words[words.length - 1];` but was not able to figure out the pattern properly. I don't have much experience in writing java code, i never thought of any other way to solve this problem. – CodeIt Dec 28 '18 at 08:58