I want to write a code that can find similarity between code files (maybe find similarity by percentage or at least "guess" which files could of been copied), I run it for 30 files and maximum 500 lines in each file. I want to identify duplicate files (or the ones that are suspected to be duplicated).
I encounter several problems:
- spacing: one code can have multiple spaces or line breaks
- comments: file with comments against file without comments or different comments
this 2 problems I thaught I can solve by removing all spaces and line breaks and comments from the code but then I encounter the following
- files that try to "hide" the similarity, consider the following 2 C files as an example
Code 1:
void main()
{
int x;
int y;
scanf("%d", &x);
switch(x)
{
case 1:
//some code
break;
case 2:
//some code
break;
}
}
Code 2:
#define ONE 1
#define TWO 2
void main()
{
int a, b;
scanf("%d", &a);
switch(a)
{
case ONE:
//some code
break;
case TWO:
//some code
break;
}
}
I would appriciate any help (maybe with existing tools or by suggesting an algorithm)
Thanks.