0

I'm trying to make a regex in Python with the followings conditions:

  • The string must contain both "happy" and "#days"
  • Letters of these words can appear in random order

For example:

string = "welcome to our pahpy and great d#ays" Have to be True cause the two words' letters are in two words in the string

I find a solution to find words but not with the random letter's order

I have to use regex because it's in my exercise instructions.

Thanks by advance !

Nicode.IO
  • 27
  • 5
  • 2
    will you please provide example String. – Achyut Vyas Sep 04 '20 at 15:00
  • I can't think of other solution than searching all combinations of letters – Bharel Sep 04 '20 at 15:00
  • Can you clarify: Can the words, or the letters _withing_ the words appear in random order, or both? In case of the letters, do they still have to be "connected" (i.e. an anagram of the word) or can there be other letters in between? – tobias_k Sep 04 '20 at 15:00
  • bro your 2nd condition just cancels the 1st condition. So you are basically just checking letters not words – Sandrin Joy Sep 04 '20 at 15:01
  • @Bharel That would be worst case search. There are better ways to achieve this I believe. You should be able to do it in `O(len(text)+len('happy'))` – Ehsan Sep 04 '20 at 15:07
  • @SandrinJoy I believe OP wants the permutation of letters be next to each other, as in substring. – Ehsan Sep 04 '20 at 15:08

2 Answers2

1

I assume there can be regex solutions to this but here is an O(len(txt)+len(word)) implementation using prime numbers hashmaps for chars. It converts your word and txt to numbers and then look for the word specific number in txt (this code can be optimized in many ways):

import math
import numpy as np
from skimage.util import view_as_windows

txt = 'haypptod#ays'
word = 'happy'

#create prime number hashmap for chars in your string
num_char = len(set(txt))
prime = [2]
cnt, num = 1, 3
while cnt < num_char:
  if all(num%i!=0 for i in range(3,int(math.sqrt(num))+1, 2)):
    prime += [num]
    cnt +=1
  num += 1  
char_hash = {k:v for k,v in zip(set(txt), prime)}

#convert your string and word to numbers
word_p = np.prod([char_hash[i] for i in word])
txt_p = view_as_windows(np.array([char_hash[i] for i in txt]), len(word)).prod(1)
print(np.any(txt_p==word_p))
#True

You need to build the hashmap only once. Of course you can repeat this for multiple words if your words are different length or simply print((word1_p in txt_p) and (word2_p in txt_p)) for both words in this case since they are same length.

Explanation:

  • Map every unique character to a unique prime number: char_hash. (for this you need to find enough number of prime numbers that cover all letters of your text. There are many ways to it, but since the alphabet is usually limited and small, you do not need to worry about this step that much)
  • Convert word and txt characters to prime numbers using char_hash
  • Calculate the product of word (this will be unique to this set of letters, any permutation of it will also have same product)
  • Calculate the product of characters in txt in a moving window of same size as your word.
  • If any of those products in txt equals the word value, you found a permutation of your word in txt
Ehsan
  • 10,716
  • 1
  • 15
  • 28
  • Could you explain this a bit more and possibly rewrite it so it does not need `skimage`, which does not really seem needed here? If I understand this correctly, this converts each word to be looked for into a product of prime numbers, then tests whether any sliding window of the same length of the text has the same product, is that correct? – tobias_k Sep 04 '20 at 16:06
  • @tobias_k What you explained is correct. Any permutation of unique prime numbers, has the same product value and it is unique to that set of prime numbers. As for `view_as_windows` you can replace it with `np.strides` or a moving window loop. I just find `view_as_windows` neater (subjective perspective). I will add more explanation. – Ehsan Sep 04 '20 at 16:10
  • This will not be the only O(n) approach btw. You can translate this into counting of course. Working with numbers are just simpler and cleaner in my opinion. – Ehsan Sep 04 '20 at 16:11
  • Upvoting for this does not have to rely on word boundaries, _however_ if OP actually wants to match whole words, this will be slower (as it matches more substrings) and might also yield false-positive in case it matches a sub-word. Really depends on what OP really wants. – tobias_k Sep 04 '20 at 16:15
  • @tobias_k Thank you. Yes. It really boils down to what OP needs. Hopefully someone finds it useful. – Ehsan Sep 04 '20 at 16:19
0

I don't think regular expressions is the right approach for this, at least if I understand your question correctly. As I understand you, you want to check whether there is an anagram of each of the words in the text. For this, you should just find a "normalized" form for those words (e.g. the lower-cased letters in sorted order) and check whether they are all in the text's normalized words.

>>> text = "some text with the words sd#ay and phayp in it"
>>> words = "happy", "#days"
>>> norm = lambda s: ''.join(sorted(s.lower()))
>>> len(set(map(norm, text.split())) & set(map(norm, words))) == 2
True

This will normalize each word in the text and the words-list exactly once, which (when sorting) takes O(nlogn) and could be reduced to O(n) (character counts), and then just a single set lookup for each normalized word, as opposed to searching for all permutations of words and characters in the words.

Of course, this assumes you want to match entire words, and not parts of words or e.g. DNA subsequences. You can (and probably should) use regular expressions instead of just split() to split the text into words, though, e.g. taking punctuation into account.

tobias_k
  • 74,298
  • 11
  • 102
  • 155
  • It is quite an assumption that the substrings have split character to divide them into words. I think OP should clarify this though. – Ehsan Sep 04 '20 at 15:19
  • Also, doesn't norm cost a lot to calculate on every word of a string? – Ehsan Sep 04 '20 at 15:20
  • @Ehsan (a) Right, this approach assumes that OP wants to match words, not parts of words or e.g. DNA subsequences; (b) no, normalizing a word, which is done exactly once per word, is O(nlogn), and could be reduced to O(n) by using e.g. a char-counter. – tobias_k Sep 04 '20 at 15:22
  • I like the char counter better. `o(nlog(n))` seems unnecessary. – Ehsan Sep 04 '20 at 15:24