-1

I have some text in a variable, raw_text and I want to count the number of continuous number sequences like 124 with Python. How would I accomplish this?

In addition, is there an efficient way to calculate the frequencies of each number sequence?

Christopher Peisert
  • 15,875
  • 3
  • 54
  • 78
Shamoon
  • 33,919
  • 63
  • 225
  • 452

2 Answers2

1

You could use a regular expression to match numeric sequences. The number of matches would be the count of continuous number sequences.

A collections.Counter would be a convenient way to get the frequencies of each match.

from collections import Counter
import re

raw_text = "blah123 hello9832 then32233 123"
matches = re.findall(r"\d+", raw_text)
print(f"found {len(matches)} number sequences")

counter = Counter(matches)
print(counter)

Output

found 4 number sequences
Counter({'123': 2, '9832': 1, '32233': 1})

To sort the results by frequency and break ties using the lexicographic ordering of the numeric sequences:

sorted_by_freq = sorted(counter.items(), key=lambda item: (-item[1], item[0]))
print(sorted_by_freq)

Output

[('123', 2), ('32233', 1), ('9832', 1)]
Christopher Peisert
  • 15,875
  • 3
  • 54
  • 78
1

You could write a tokenizer:

raw_text = "tunapro1234test123"


def tokenizer(text):
    i = 0
    numbers = []
    while i < len(raw_text):

        if raw_text[i].isdigit():
            numbers.append("")

            while i < len(raw_text) and raw_text[i].isdigit():
                numbers[-1] += raw_text[i]
                i += 1

        i += 1
    return numbers

numbers = tokenizer(raw_text)
number_sequences = len(numbers)
print(numbers, number_sequences, sep="\n")

(same but generator)

raw_text = "tunapro1234test123"

def tokenizer_2(iterable):
    generator = (i for i in iterable)
    last_number = ""
    for char in generator:
        if char.isdigit():
            last_number += char

            for char in generator:
                if not char.isdigit():
                    break
                last_number += char
            yield last_number
            last_number = ""

def number_sequences(raw_text):
    return len(list(tokenizer_2))

numbers = tokenizer_2(raw_text)
number_sequences = len(list(numbers))
print(numbers, number_sequences, sep="\n")

OUTPUT:

['1234', '123']
2

(both codes have the same output)

TUNAPRO1234
  • 106
  • 5