Checking for duplicate content in file in C#

Question

I have two or more log files that will be merged into a new file.

Log file format could be like

Dir1 File1Path1 File1Path2 Timestamp tempfileName
Dir1 File2Path1 File2Path2 Timestamp tempfileName
Dir2 File1Path1 File1Path2 Timestamp tempfileName`

Dir3 File1Path1 File1Path2 Timestamp tempfileName
Dir3 File2Path1 File2Path2 Timestamp tempfileName
Dir3 File1Path1 File1Path2 Timestamp tempfileName
Dir4 File1Path1 File1Path2 Timestamp tempfileName`

etc.

My requirements are as follows ;

Check the format is right in the each line in each log file, i.e. all values are recorded
Check there are no duplicates
Verify the files are merged properly i.e. all log lines from each log file has been merged into the new log file.
Compare the new merged file to a baseline file

I already have written code for 1. I read the file and load the contents into a dataset, by row/colum.

        data.Tables[tableName].Columns.Add("Dir");
        data.Tables[tableName].Columns.Add("Path1");
        data.Tables[tableName].Columns.Add("Path2");

        using (StreamReader reader = new StreamReader(log))
        {
            string line = string.Empty;
            while ((line = reader.ReadLine()) != null)
             {
                 data.Tables[tableName].Rows.Add(line.Split(new string[] { "\t" }, data.Tables[tableName].Columns.Count, StringSplitOptions.RemoveEmptyEntries));
             }
        }

But to accomplish the rest of the tasks, I am not sure if loading the lines into dataset is right? What is the fastest and better approach for this? I can loop over each row value and compare to rest, but I dont think it will faster. log files can be between 20 - 45MB.

Merged log contents should be like this (lines can in any order)

Dir1 File1Path1 File1Path2 Timestamp tempfileName
Dir1 File2Path1 File2Path2 Timestamp tempfileName
Dir2 File1Path1 File1Path2 Timestamp tempfileName
Dir4 File1Path1 File1Path2 Timestamp tempfileName
Dir3 File1Path1 File1Path2 Timestamp tempfileName
Dir3 File2Path1 File2Path2 Timestamp tempfileName
Dir3 File1Path1 File1Path2 Timestamp tempfileName

Thanks for looking.

If people could see your code you'd get more help. Why not post the code in your question? — dcaswell, Sep 10 '13 at 19:23
Could you define merged properly because there are a lot of ways to define that with this kind of data — Tony, Sep 10 '13 at 19:23
AArgh. I added code, but I dont know why it disappeared now. Adding again. — user393148, Sep 10 '13 at 19:35

score 2 · Answer 1 · edited May 23 '17 at 11:50

If you can load all of the data into memory at once, then checking duplicates is easy: just load the data and let LINQ remove the duplicates. That is:

List<string> lines = LoadEverything();
foreach (line in lines.Distinct()) // might want to supply an equality comparer
{
    // write line to output file
}

If you can't load all of the files in memory at once, then load each one, sort it, and output the sorted list to a new file. Then do an n-way merge on the sorted files to remove duplicates.

Either of these is going to be a whole lot faster than using List.Contains() on a list of any significant size.

You didn't say whether you want to remove duplicates from each individual file, or if you want to remove duplicates from the combined file. Removing duplicates from the individual files is easy: just load each file into memory, do a Distinct on it, and then write it to the output. The discussion above assumes that you want to remove duplicates from the combined file, which is a little harder if you can't load everything into memory at once.

If all you want is to determine if there are duplicates, and what those duplicates are:

var dupes = 
    lines.GroupBy(l => l)
         .Select(g => new { Value = g.Key, Count = g.Count() })
         .Where(g => g.Count > 1);
foreach (var d in dupes)
{
    Console.WriteLine("'{0}' is a dupe.", d.Key);
}

Or, if you just want to know if there are any duplicates:

if (dupes.Any())
    Console.WriteLine("There are duplicates!");

Checking for duplicate content in file in C#

1 Answers1