2

I need to do is remove the pair and mirror pairs in array. I means for each pair counted twice I want to remove the correspondent rows, and also the row of the mirror pair.

Example:

I have the matrix A:

A=array([[ 0, 55],
   [ 5, 25],
   [ 5, 25],
   [12, 62],
   [27, 32],
   [25, 73],
   [55,  0],
   [25,  5],
   [62, 12],
   [32, 27],
   [99, 95]])

So what I want to obtained is following Matrix B:

B = array([ [25, 73],
            [99 ,95])

In which have been removed the pair counted twice [5,25] and its mirror pair [25,5], and [0,55] and its mirror pair, as well as for [12, 62] and [27, 32].

accdias
  • 3,827
  • 2
  • 15
  • 28
Matteo
  • 31
  • 6
  • I forgot it, but since the dimension of the matrix would be very large I need to do this operation in the faster way possible. I have found a way but this employ several "for" and it is very slow. – Matteo Apr 23 '21 at 09:34
  • Can you check your expected result again? How I understand it, [12, 62] and [27, 32] should not be in B since [62, 12] and [32, 27] are also in A. – pktl2k Apr 23 '21 at 13:00
  • You are right, I am going to replace the right matrix B – Matteo Apr 23 '21 at 16:12
  • is the dtype always int? – Pierre D Apr 23 '21 at 18:17
  • 1
    @Matteo check the solution and let me know If you still finding any issue – Exploore X Apr 23 '21 at 18:19
  • @PierreD Yes the elements are int – Matteo Apr 23 '21 at 18:27
  • @ExplooreX Thanks you for your answer, I have to test it, and see how many time it takes to do this operation for dimension of matrix what I deal with – Matteo Apr 23 '21 at 18:44

3 Answers3

2

LOGIC:
1. Main logic is that how a particular index is satisficing both condition which are have same pair and have minor pair. Here we can use the vectorize property of Numpy to check both condition and code for that is :

cond_i = (i == A) | (i[::-1] == A)

2. If sum of True condition is exactly equal to 1 which mean is don't satisfy the both condition so this will be our requirement, for this condition check use this code :

sum(list((cond_i[:,0] & cond_i[:,1])))

3. Finally append this condition satisfying result.

B.append((A[j]))

CODE :

from numpy import array
import numpy as np
A = array([[0, 55], [5, 25], [5, 25], [12, 62], [27, 32], [25, 73], [55, 0],
           [25, 5], [62, 12], [32, 27], [99, 95]])

B = []
for j, i in enumerate(A):
    if np.sum(np.all(i == A, axis=1) | np.all(i[::-1] == A, axis=1)) == 1:
        B.append((A[j]))

print(np.array(B))

OUTPUT :

[[25 73]
 [99 95]]
Exploore X
  • 1,452
  • 1
  • 3
  • 19
  • it is faulty and drops some unique rows. Here is an example: `A = np.array([[1, 1], [0, 0], [2, 2], [1, 0]])`. All these rows are unique, but the code returns `[[1, 1], [0, 0], [2, 2]]` and drops `[1, 0]`. – Pierre D Apr 23 '21 at 21:04
  • @PierreD I really thanks to pointing out the bug now please check it and let me know any other issue – Exploore X Apr 23 '21 at 21:25
  • you can use `for r in A`, then `if np.sum(np.all(r == A, axis=1) | np.all(r[::-1] == A, axis=1)): B.append(r)`. But in the end it's still `O(n^2)` (one `n` in the Python loop, the other for the `r` vs `A` operations). – Pierre D Apr 23 '21 at 21:27
  • @PierreD thanks for such a nice discussion and help me to figure out various bugs. Thanks a lot – Exploore X Apr 23 '21 at 21:42
1

For more than ~40 rows, these solutions (the numpy one for up to about 1000 rows, then the Pandas-based one) are the fastest so far.

Here is what I would do for a vectorized operation (fast, no Python loops):

import pandas as pd

def unique_pairs(a):
    df = pd.DataFrame({'x': a.min(axis=1), 'y': a.max(axis=1)})
    return a[~df.duplicated(keep=False)]

B = unique_pairs(A)

# on your example:
>>> B
array([[25, 73],
       [99, 95]])

If you are looking for a pure numpy solution (alas, as per the note below, it is slower for large arrays):

def np_unique_pairs(a):
    z = np.stack((a.min(axis=1), a.max(axis=1)))
    _, rix, cnt = np.unique(z, return_inverse=True, return_counts=True, axis=1)
    return a[(cnt==1)[rix]]

Performance

A = np.random.randint(0, 10, (1_000_000, 2))

%timeit unique_pairs(A)
# 45.6 ms ± 49.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Notes

  • np.unique(a, axis=0) is quite a bit slower than Pandas family of duplicate functions (drop_duplicates(), duplicated(), etc.). See issue #11136.
  • there are other ways that could work, such as mapping each pair of numbers on a row onto a single integer. See this SO answer answer for some ideas.

Speed comparison

Here is a comparison of the speed of 4 methods:

  • pd_unique_pairs() is the Pandas solution proposed above.
  • np_unique_pairs() is the pure Numpy solution proposed above.
  • counter_unique_pairs() is proposed by @Acccumulation and is based on the use of Python loops and Counter.
  • loop_unique_pairs() is proposed by @ExploreX and is based on explicit Python loops.

enter image description here

Clearly, for more than 1000 rows, pd_unique_pairs dominates. Between roughly 40 and 1000 rows, np_unique_pairs wins. For very small arrays (under 40 rows), then counter_unique_pairs is most effective.

# additional code for the perfplot above
import perfplot


def counter_unique_pairs(A):
    A_Counter = Counter((tuple(sorted(item)) for item in A))
    single_instances = [item for item in A_Counter if A_Counter[item]==1]
    B = np.array([item for item in A if tuple(sorted(item)) in single_instances])
    return B

def loop_unique_pairs(A):
    B = []
    for j,i in enumerate(A):
        cond_i = (i == A) | (i[::-1] == A)

        if sum(list((cond_i[:,0] & cond_i[:,1]))) == 1:
            B.append((A[j]))
    B = np.array(B)
    return B

perfplot.show(
    setup=lambda n: np.random.randint(0, np.sqrt(n).astype(int), (n, 2)),
    kernels=[pd_unique_pairs, np_unique_pairs, counter_unique_pairs,  loop_unique_pairs],
    n_range=[2 ** k for k in range(3, 14)],
    equality_check=None,  # had to disable since loop_ appear to be wrong sometimes
    
)
Pierre D
  • 13,780
  • 6
  • 42
  • 72
1

The best code that I was able to come up with was

A_Counter = Counter((tuple(sorted(item)) for item in A))
single_instances = [item for item in A_Counter if A_Counter[item]==1]

There's likely further optimization possible.

I found that, using your sample data, this took 52 microseconds, compared to 2574 microseconds for Pierre D's answer and 1024 microseconds for Exploore X's.

It does, however, return each item sorted and as a tuple. This can be corrected with B = [item for item in A if tuple(sorted(item)) in single_instances], which brings its time up to 88 microseconds.

Also, this code looks for duplicates based on any permutation of the elements, not just mirroring. For two-element lists, these two are equivalent, but if you want to expand it to data that has lists of more than two elements, and you want just mirrored lists, you'll have to adjust the code.

Acccumulation
  • 2,309
  • 1
  • 5
  • 10
  • It's brilliant, unfortunately I can't try any code because I'm not home right now, but it seems to be very efficient. At least according to what you report about the time. – Matteo Apr 23 '21 at 19:29
  • but this is only for a tiny array! Try on a 1000 row one... – Pierre D Apr 23 '21 at 19:31
  • 1
    a key quote is from the OP's comment on the question itself: "_since the dimension of the matrix would be very large I need to do this operation in the faster way possible_". In my book, a "very large matrix" is more than `(11,2)`. – Pierre D Apr 23 '21 at 19:33
  • ok, I added in my answer some comparative timing measurements of all solutions for a range of sizes. – Pierre D Apr 23 '21 at 20:35