1

I am trying to index each word in a text file Using java

Index means i am denoting indexing of words here..

This is my sample file https://pastebin.com/hxB8t56p (the actual file I want to index is much larger)

This is the code I have tried so far

ArrayList<String> ar = new ArrayList<String>();
ArrayList<String> sen = new ArrayList<String>();
ArrayList<String> fin = new ArrayList<String>();
ArrayList<String> word = new ArrayList<String>();
String content = new String(Files.readAllBytes(Paths.get("D:\\folder\\poem.txt")), StandardCharsets.UTF_8);

String[] split = content.split("\\s"); // Split text file content
for(String b:split) {
    ar.add(b); // added into the ar arraylist //ar contains every line of poem
}
FileInputStream fstream = null;
String answer = "";fstream=new FileInputStream("D:\\folder\\poemt.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
int count = 1;
int songnum = 0;

while((strLine=br.readLine())!=null) {
    String text = strLine.replaceAll("[0-9]", ""); // Replace numbers from txt
    String nums = strLine.split("(?=\\D)")[0]; // get digits from strLine
    if (nums.matches(".*[0-9].*")) {
        songnum = Integer.parseInt(nums); // Parse string to int
    }
    String regex = ".*\\d+.*";
    boolean result = strLine.matches(regex);
    if (result == true) { // check if strLine contain digit
        count = 1;
    }
    answer = songnum + "." + count + "(" + text + ")";
    count++;
    sen.add(answer); // added songnum + line number and text to sen
}

for(int i = 0;i<sen.size();i++) { // loop to match and get word+poem number+line number
    for (int j = 0; j < ar.size(); j++) {
        if (sen.get(i).contains(ar.get(j))) {
            if (!ar.get(j).isEmpty()) {
                String x = ar.get(j) + " - " + sen.get(i);
                x = x.replaceAll("\\(.*\\)", ""); // replace single line sentence
                String[] sp = x.split("\\s+");
                word.add(sp[0]); // each word in the poem is added to the word arraylist
                fin.add(x); // word+poem number+line number
            }
        }
    }
}
Set<String> listWithoutDuplicates = new LinkedHashSet<String>(fin); // Remove duplicates
fin.clear();fin.addAll(listWithoutDuplicates);
Locale lithuanian = new Locale("ta");
Collator lithuanianCollator = Collator.getInstance(lithuanian); // sort array
Collections.sort(fin,lithuanianCollator);
System.out.println(fin);   


    (change in blossom. - 0.2,1.2, &  the - 0.1,1.2, & then - 0.1,1.2)
hesh
  • 15
  • 4
  • It is difficult to read your code. Please reformat it. The test case really helps, though. – PalLaden Jun 23 '20 at 07:29
  • This question will likely need more upvotes to gather enough attention. – PalLaden Jun 23 '20 at 07:32
  • Please elaborate on the word **index**. The test case shows that the words "And" , "more" are not considered. Why? – PalLaden Jun 23 '20 at 07:38
  • Does this answer your question? [Reading a plain text file in Java](https://stackoverflow.com/questions/4716503/reading-a-plain-text-file-in-java) – Eldar B. Jun 23 '20 at 07:51
  • @Pal Laden actually there is no space between 0. and And which i used to split words .my original file has space .that was my mistake i forgot to leave space there:( – hesh Jun 23 '20 at 08:02
  • @Eldar i dont want to just read a text file i want to create indexfor each word in text file – hesh Jun 23 '20 at 08:05
  • @Pal Laden i have corrected my pastbin thank you. – hesh Jun 23 '20 at 08:08
  • Well, then you can just read a file and use `.split(" ")` to turn it into an array. Then, instead of using (for unknown reason) four entire ArrayLists and an array, you can actually write efficient code. – Eldar B. Jun 23 '20 at 08:08
  • yaa.i used "\\s+" to split here.Please See my Pastbin i have attached my output there – hesh Jun 23 '20 at 08:10
  • I believe `split("\\b")` is more appropriate for splitting the string into words. For example it will give you the word `blossom` and not `blossom.` (with the trailing period) – Abra Jun 23 '20 at 08:18
  • @Abra thank you for that.i will replace in my code – hesh Jun 23 '20 at 08:22
  • @Abra. note that the "expected output" includes trailing punctuation – tucuxi Jun 23 '20 at 09:31

1 Answers1

0

I will first copy the intended output for your pasted example, and then go over the code to find how to change it:

Poem.txt

0.And then the day came,
  to remain blossom.
1.more painful
  then the blossom.

Expected output

[blossom. - 0.2,1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1,1.2, then - 0.1,1.2, to - 0.2]

As @Pal Laden notes in comments, some words (the, and) are not being indexed. It is probable that stopwords are being ignored for indexing purposes.

Current output of code is

[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]

So, assuming you fix your stopwords, you are actually quite close. Your fin array contains word+poem number+line number, but it should contain word+*list* of poem number+line number. There are several ways to fix this. First, we will need to do stopword removal:

// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x); 

Now, lets fix the list problem. The easiest (but ugly) way is to fix "fin" at the very end:

List<String> fixed = new ArrayList<>();
String prevWord = "";
String prevLocs = "";
for (String s : fin) {
    String[] parts = s.split(" - ");
    if (parts[0].equals(prevWord)) {
        prevLocs += "," + parts[1];
    } else {
        if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
        prevWord = parts[0];
        prevLocs = parts[1];
    }
}
// last iteration
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);

System.out.println(fixed);

How to do it the right way (TM)

You code can be much improved. In particular, using flat ArrayLists for everything is not always the best idea. Maps are great for building indices:

// build stopwords
String[] stopWords = new String[]{ "and", "a", "the", "to", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

// prepare always-sorted, quick-lookup set of terms
Collator lithuanianCollator = Collator.getInstance(new Locale("ta"));
Map<String, List<String>> terms = new TreeMap<>((o1, o2) -> lithuanianCollator.compare(o1, o2));

// read lines; if line starts with number, store separately
Pattern countPattern = Pattern.compile("([0-9]+)\\.(.*)");
String content = new String(Files.readAllBytes(Paths.get("/tmp/poem.txt")), StandardCharsets.UTF_8);
int poemCount = 0;
int lineCount = 1;
for (String line: content.split("[\n\r]+")) {
    line = line.toLowerCase().trim(); // remove spaces on both sides

    // update locations
    Matcher m = countPattern.matcher(line);
    if (m.matches()) {
        poemCount = Integer.parseInt(m.group(1));
        lineCount = 1;
        line = m.group(2); // ignore number for word-finding purposes
    } else {
        lineCount ++;
    }

    // read words in line, with locations already taken care of
    for (String word: line.split(" ")) {
        if ( ! toIgnore.contains(word)) {
            if ( ! terms.containsKey(word)) {
                terms.put(word, new ArrayList<>());
            }
            terms.get(word).add(poemCount + "." + lineCount);
        }
    }
}

// output formatting to match that of your code
List<String> output = new ArrayList<>();
for (Map.Entry<String, List<String>> e: terms.entrySet()) {
    output.add(e.getKey() + " - " + String.join(",", e.getValue()));
}
System.out.println(output);

Which gives me [blossom. - 0.2,1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, to - 0.2]. I have not fixed the list of stopwords to get a perfect match, but that should be easy to do.

tucuxi
  • 15,614
  • 2
  • 36
  • 70