MapReduce-KNN for Hadoop - run multiple test cases from one data file

Question

Background: [Skip ahead to next section for exact problem]

I am currently working on Hadoop as a small project in my University (not a mandatory project, I am doing it because I want to).

My plan was to use 5 PCs in one of the labs (Master + 4 Slaves) to run a KNN algorithm on a large data set to find out the running time, etc.

I knew I could find the basic code on the internet, and I did find it ( https://github.com/matt-hicks/MapReduce-KNN ). It runs fine for a single test case, but what I have is a very large one with hundreds of test cases. Therefore, I needed to iterate the same bit of code for each test case.

The Problem:

tl;dr: I have a KNN program that only takes one test case at a time, but I want to make it iterative so that it can work with multiple test cases.

My Solution:

I am not very experienced with this and from basics I know, I decided to make the variables and maps into arrays of variables and arrays of maps.

So this:

    public static class KnnMapper extends Mapper<Object, Text, NullWritable, DoubleString>
    {
        DoubleString distanceAndModel = new DoubleString();
        TreeMap<Double, String> KnnMap = new TreeMap<Double, String>();

        // Declaring some variables which will be used throughout the mapper
        int K;

        double normalisedSAge;
        double normalisedSIncome;
        String sStatus;
        String sGender;
double normalisedSChildren;

became this:

DoubleString distanceAndModel = new DoubleString();
    TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];



    // Declaring some variables which will be used throughout the mapper
    int[] K = new int[1000];

    double[] normalisedSAge = new double[1000];
    double[] normalisedSIncome = new double[1000];
    String[] sStatus = new String[1000];
    String[] sGender = new String[1000];
    double[] normalisedSChildren = new double[1000];
    int n = 0;

And this:

        protected void setup(Context context) throws IOException, InterruptedException
    {
        if (context.getCacheFiles() != null && context.getCacheFiles().length > 0)
        {
            // Read parameter file using alias established in main()
            String knnParams = FileUtils.readFileToString(new File("./knnParamFile"));
            StringTokenizer st = new StringTokenizer(knnParams, ",");

            // Using the variables declared earlier, values are assigned to K and to the test dataset, S.
            // These values will remain unchanged throughout the mapper
            K = Integer.parseInt(st.nextToken());
            normalisedSAge = normalisedDouble(st.nextToken(), minAge, maxAge);
            normalisedSIncome = normalisedDouble(st.nextToken(), minIncome, maxIncome);
            sStatus = st.nextToken();
            sGender = st.nextToken();
            normalisedSChildren = normalisedDouble(st.nextToken(), minChildren, maxChildren);
        }

}

became this:

protected void setup(Context context) throws IOException, InterruptedException
    {
        if (context.getCacheFiles() != null && context.getCacheFiles().length > 0)
        {
            // Read parameter file using alias established in main()
            String knnParams = FileUtils.readFileToString(new File("./knnParamFile"));
            //Splitting input File if we hit a newline character or return carriage i.e., Windown Return Key as input
            StringTokenizer lineSt = new StringTokenizer(knnParams, "\n\r");

            //Running a loop to tokennize each line of inputs or test cases
            while(lineSt.hasMoreTokens()){
            String nextLine = lineSt.nextToken();   //Converting current line to a string
            StringTokenizer st = new StringTokenizer(nextLine, ","); // Tokenizing the current string or singular data

            // Using the variables declared earlier, values are assigned to K and to the test dataset, S.
            // These values will remain unchanged throughout the mapper
            K[n] = Integer.parseInt(st.nextToken());
            normalisedSAge[n] = normalisedDouble(st.nextToken(), minAge, maxAge);
            normalisedSIncome[n] = normalisedDouble(st.nextToken(), minIncome, maxIncome);
            sStatus[n] = st.nextToken();
            sGender[n] = st.nextToken();
            normalisedSChildren[n] = normalisedDouble(st.nextToken(), minChildren, maxChildren);
            n++;
        }}
    }

And so on for the reducer class as well.

This is the first time I was working with TreeMaps though. I've studied and used trees before, but not Maps or TreeMaps. I still tried to make it and array which turned out to be wrong:

/home/hduser/Desktop/knn/KnnPattern.java:81: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

/home/hduser/Desktop/knn/KnnPattern.java:198: error: incompatible types: double[] cannot be converted to double normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren); ^

/home/hduser/Desktop/knn/KnnPattern.java:238: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

/home/hduser/Desktop/knn/KnnPattern.java:283: error: bad operand types for binary operator '>' if (KnnMap[num].size() > K) ^ first type: int second type: int[]

Now, I thought that maybe if I tried to use a Linked List of TreeMaps, it could work.

But, I have basically worked with C/C++ and Python in Uni so far. OOP here seems to make life easier for people but I am not a 100% sure how to use it.

My question:

Is it possible to make a Linked List of TreeMaps?

Is there a Linked List substitute for:

TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];

And is my approach to the problem correct? Making the code iterative should help iterate through all test cases, right?

I will, with try and error, try to make it work from there. But this is something I am kind of stuck at since a few days now.

My apologies if someone has already asked this before but I couldn't find anything and so I had to write a question. Please share the link of any related answer if you think this had already been answered before.

Thank you! And, on a side note: Anything else I should keep in mind when working with TreeMaps and specifically a linked list of TreeMaps.

hello @JackFlamp is there any way I can make this question better? I read the link you shared, and I have tried to minimize the code to the 2 important parts and have added various sections so that my ACTUAL question is separate and mostly a standalone question. The rest of it is mostly context that I felt might be required as the answer may vary depending on the usage of the treemap. I appreciate the quick response though. Thank you. :) — Vinit, Aug 28 '18 at 08:05

ltlBeBoy · Accepted Answer · 2018-08-28T17:04:02.260

Regarding the error messages

/home/hduser/Desktop/knn/KnnPattern.java:81: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

and

/home/hduser/Desktop/knn/KnnPattern.java:238: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

These errors occur because you tried to create an instance from a generic component type which is not supported by Java because the generic types are lost at runtime. A workaround (if you really need an array) would be to create a List of TreeMap objects and then to convert it to an array:

// TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];
List<TreeMap<Double, String>> KnnMapList = new LinkedList<>();
TreeMap<Double, String>[] KnnMap = (TreeMap<Double, String>[]) KnnMapList.toArray();

See this question for further information.

/home/hduser/Desktop/knn/KnnPattern.java:198: error: incompatible types: double[] cannot be converted to double normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren); ^

By looking at the source code at GitHub I realized that you probably did not modify the following method call in method KnnMapper#map(Object, Text, Context):

double tDist = totalSquaredDistance(normalisedRAge, normalisedRIncome, rStatus, rGender,
                    normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren);

should be

double tDist = totalSquaredDistance(normalisedRAge, normalisedRIncome, rStatus, rGender,
                    normalisedRChildren, normalisedSAge[n], normalisedSIncome[n], sStatus[n], sGender[n], normalisedSChildren[n]);

But I guess that these modifications will not give you the desired function because KnnMapper#map(Object, Text, Context) is only called once per key/value pair as stated here and you probably want to call it n-times.

Concrete problem

To prevent from further trouble I suggest that you leave the upper code of the GitHub class untouched and only modify the KnnPattern#main(String[]) method in a way so that it calls the job n-times as described in this answer.

Edit: Example

This is a modified KnnPattern#main(String[]) method that reads your data file line by line, creates a temporary file with the current line as content and starts a job with the temporary file as cache file.
(Assuming that you are using at least Java 7)

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
...
public class KnnPattern
{
  ...
    public static void main(String[] args) throws Exception {
        // Create configuration
        Configuration conf = new Configuration();

        if (args.length != 3) {
            System.err.println("Usage: KnnPattern <in> <out> <parameter file>");
            System.exit(2);
        }

        try (final BufferedReader br = new BufferedReader(new FileReader(args[2]))) {
            int n = 1;
            String line;
            while ((line = br.readLine()) != null) {
                // create temporary file with content of current line
                final File tmpDataFile = File.createTempFile("hadoop-test-", null);
                try (BufferedWriter tmpDataWriter = new BufferedWriter(new FileWriter(tmpDataFile))) {
                    tmpDataWriter.write(line);
                    tmpDataWriter.flush();
                }

                // Create job
                Job job = Job.getInstance(conf, "Find K-Nearest Neighbour #" + n);
                job.setJarByClass(KnnPattern.class);
                // Set the third parameter when running the job to be the parameter file and give it an alias
                job.addCacheFile(new URI(tmpDataFile.getAbsolutePath() + "#knnParamFile")); // Parameter file containing test data

                // Setup MapReduce job
                job.setMapperClass(KnnMapper.class);
                job.setReducerClass(KnnReducer.class);
                job.setNumReduceTasks(1); // Only one reducer in this design

                // Specify key / value
                job.setMapOutputKeyClass(NullWritable.class);
                job.setMapOutputValueClass(DoubleString.class);
                job.setOutputKeyClass(NullWritable.class);
                job.setOutputValueClass(Text.class);

                // Input (the data file) and Output (the resulting classification)
                FileInputFormat.addInputPath(job, new Path(args[0]));
                FileOutputFormat.setOutputPath(job, new Path(args[1] + "_" + n));

                // Execute job
                final boolean jobSucceeded = job.waitForCompletion(true);

                // clean up
                tmpDataFile.delete();

                if (!jobSucceeded) {
                    // return error status if job failed
                    System.exit(1);
                }

                ++n;
            }
        }
    }

}

It takes the data from a txt file and then breaks it into Tokens in the Mapper. This is the first time I am actually working with an unconventional method of input, i.e., a text file (the usual being on the normal console for say Eclipse or Codeblocks). This is why I thought it's better I left the main function and do what I know a little bit about. I am breaking the input from the txt file to lines and then using the method the original author used to break each line into tokens for one set of the input. If that makes sense. — Vinit, Aug 28 '18 at 14:55
Also, I really appreciate the answer. Thank you for the quick response. :) — Vinit, Aug 28 '18 at 14:57
@vinit144 Added an example which creates a new job for each line of your data file. — ltlBeBoy, Aug 28 '18 at 17:06
That really helps a lot. I will continue working on this for my actual data set. Thank you once again. Really appreciate the help :) — Vinit, Aug 28 '18 at 20:07
It almost worked fine today (I finally was able to get a free day and tested it in my lab), but on doing: hadoop jar KNN02.jar KnnPattern /knn/Input/CarOwners.csv /knn/Output /knn/Params/params.txt I got: Exception in thread "main" java.io.FileNotFoundException: /knn/Params/params.txt (No such file or directory) I triple checked, the params file is in hdfs and it worked fine when I tried it on the original code. I just can't understand how it can't find the file. Sorry for bothering you again :) — Vinit, Sep 01 '18 at 13:34

MapReduce-KNN for Hadoop - run multiple test cases from one data file

1 Answers1