0

Hitwriter_sampleI have a huge tab delimited BLASTn table report in notepad++. It contain duplicate records in the rows. I want to remove the whole rows contain these duplicate records except one. this will make going through this table much easier. how can I do this? TextFX blugin only sorts them, do I miss something in it? The regex presented removes all file contents.

  • 3
    Possible duplicate of [Removing duplicate rows in Notepad++](https://stackoverflow.com/questions/3958350/removing-duplicate-rows-in-notepad) – Toto Jul 20 '18 at 07:28
  • I saw this post but nothing worked for me from the solutions said. – Mohamed Malash Jul 21 '18 at 14:09
  • 1
    As the possible duplicate does not help you then please give a simple example (at most 20 lines) of the data you have and the output you want. The description in your question is insufficient for us to understand why the duplicate is not helpful. – AdrianHHH Jul 21 '18 at 16:38
  • I put a picture of an example of my data and indicated on it the records I want to delete rows which have duplicates of. here https://i.stack.imgur.com/j7HR6.png It is also now in the question above. – Mohamed Malash Jul 23 '18 at 15:16
  • 2
    Please, don't add an image but an extract of your file and expected result. – Toto Jul 23 '18 at 15:27
  • Ok, dear. Here it is https://pastebin.com/raw/rtiyvgB0 – Mohamed Malash Jul 24 '18 at 08:34

1 Answers1

1

TextFX is a plugin that comes with 32-bit versions of notepadd++. That plugin has an option to remove duplicates.

Else, you could use this regexp on Replace (Control+H) to remove duplicates. Remember to tick . matches new line. Replace by nothing

^([^\r\n]*)\r?\n(?=.*^\1(?:\r?\n|\z))

See: https://regex101.com/r/Imq3OZ/1/

UPDATE

I also add an option to filter rows based on a part of a specific column (third one on your case).

Try this: ^[^\t]*+\t[^\t]*+\t.{3}\|(NODE[^\t]*+)\t[^\n]*+\r?\n(?=[\s\S]*^[^\t]*\t[^\t]*\t.{3}\|\1\t)

Demo: https://regex101.com/r/xDLaS8/3/

Julio
  • 4,491
  • 1
  • 8
  • 33
  • I installed it but it only sorts rows according to first cell in the row and no duplicate removal option present. and the regex removed all and the file became empty. Do you know when you click "find all in current document", it opens a new subwindow underneath with all rows that contain this duplicate record. Do you have a way to remove all these hits from from this subwindow? – Mohamed Malash Jul 21 '18 at 14:16
  • I have uploaded an animated gif. Could you double check you are doing the same? Also, a sample text would be useful – Julio Jul 21 '18 at 19:56
  • It removed all document. Look here please or a sample of what I mean https://i.stack.imgur.com/j7HR6.png – Mohamed Malash Jul 23 '18 at 15:17
  • Oops, that's a totally different thing. That is not duplicated rows, since you are looking at a very concrete part of a row and not the whole row. Could you please upload the file (as text, instead of image) somewhere? Here at stackoverflow or at some site like www.pastebin.com – Julio Jul 23 '18 at 15:48
  • Does the column you are interested in is always on the same position (third column)? – Julio Jul 24 '18 at 08:54
  • @MohamedMalash, I have updated my answer. Try with the new regexp – Julio Jul 24 '18 at 09:09
  • Thanks dear. yes it is the 3rd column. This regex reduces the number of duplicates, but does not remove all of them except one occurrence. Can you modify it a little to do that? – Mohamed Malash Jul 25 '18 at 20:08
  • @MohamedMalash I think It now should be fixed. Try with the new regexp and let me know – Julio Jul 25 '18 at 21:40
  • It works :) Thanks dear Julio for your patience and the great help. God bless you and wish you all the best. – Mohamed Malash Jul 26 '18 at 01:17