-2

I have a text file containing approximately Hindi 30000 words. I would like to fetch a unique word from the file and save it into a tabular form. I am trying this in java, but I am not sure how to achieve this. Any help is highly appreciated.

  • Is there a specific reason that those 30'000 words are not in a database? How do you define the unique word that you need to find? With regular expression? – Gildraths Jun 07 '16 at 08:06
  • unique words in sense that repeated ones are not included. only one occurence of a word is required to take and save in the table. reason is that i have to differntiate the stop word, root words and inflected words from the file. – Suman Chaudhary Jun 07 '16 at 09:15
  • So basically out of the 30'000 word there are for example 15'000 words (no double entries) that you want to save in a tabular form, correct? What about that tabular form, is it displayed in a gui or saved in an excel sheet or what is the thought behind it? – Gildraths Jun 07 '16 at 09:19
  • Also is each new word on a new line or how are they separated? – Gildraths Jun 07 '16 at 10:17
  • for example in following 38 words are there n unique words are the total words without repetition i.e 30. हालाँकि सूर के जीवन के बारे में कई जनश्रुतियाँ प्रचलित हैं, पर इन में कितनी सच्चाई है यह कहना कठिन है। कहा जाता है उनका जन्म सन् १४७८ में दिल्ली के पास एक ग़रीब ब्राह्मीण परिवार में हुआ। the expected output is 1.हालाँकि 2.सूर 3. के 4.जीवन 5.बारे 6.में 7.कई 8.ज.नश्रुतियाँ 9.प्रचलित 10.हैं 11.पर 12. इन 13.कितनी 14.सच्चाई 15.यह 16.कहना 17.कठिन 18.कहा 19.जाता 20. उनका 21. जन्म word save in the file it should be ignored and not considered. – Suman Chaudhary Jun 07 '16 at 10:50
  • my concern is that output is in tabular form either in excel or in word but it should be editable. there are many separators in Hindi language but we just have to store the every new word which is not present in the output table till the processing of the word is done for first time. Separaters and new lines doesn't matter . Reason behind the tabulare for is i have to remove stop words and inflected words from the list for further processing – Suman Chaudhary Jun 07 '16 at 10:56

1 Answers1

0

I would suggest you use a Set http://docs.oracle.com/javase/6/docs/api/java/util/Set.html to store your strings.

The advantage is that it does not allow the value more than once. Here an example:

Set<String> storage = new HashSet<String>; //use TreeSet<String> if you need to sort the values
storage.add("dog");
storage.add("cat");
storage.add("cat");

for(String name: set) {
  System.out.println(name); //Values are: dog, cat
}

You can read the file like this: Reading a plain text file in Java.

Basically you can save it as plaintext with "," between and save the file as csv. Then you can import it easily into excel

Community
  • 1
  • 1
Gildraths
  • 386
  • 1
  • 10