-1

How can I capture class and methods from a python file?

I don't care about the attrs or args.

class MyClass_1(...):
    ...
    def method1_of_first_class(self):
        ...

    def method2_of_first_class(self):
        ...

    def method3_of_first_class(self):
        ...

class MyClass_2(...):
    ...
    def method1_of_second_class(self):
        ...

    def method2_of_second_class(self):
        ...

    def method3_of_second_class(self):
        ...

What I tried so far:

class ([\w_]+?)\(.*?\):.*?(?:def ([\w_]+?)\(self.*?\):.*?)+?

Options: dot matches newline

CAPTURING THE CLASS

Match the characters “class ” literally «class »
Match the regular expression below and capture its match into backreference number 1 «([\w_]+?)»
   Match a single character present in the list below «[\w_]+?»
      Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
      A word character (letters, digits, etc.) «\w»
      The character “_” «_»
Match the character “(” literally «\(»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “)” literally «\)»
Match the character “:” literally «:»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»

CAPTURING THE METHODS:

Match the regular expression below «(?:def ([\w_]+?)\(self.*?\):.*?)+?»
   Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
   Match the characters “def ” literally «def »
   Match the regular expression below and capture its match into backreference number 2 «([\w_]+?)»
      Match a single character present in the list below «[\w_]+?»
         Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
         A word character (letters, digits, etc.) «\w»
         The character “_” «_»
   Match the character “(” literally «\(»
   Match the characters “self” literally «self»
   Match any single character «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match the character “)” literally «\)»
   Match the character “:” literally «:»
   Match any single character «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»

But it only captures the class name and the first method, I think that's because the backreference number 2 can't capture more than 1, even tho it's inside a (?:myregex)+?

Current Output:

'MyClass_1':'method1_of_first_class',
'MyClass_2':'method1_of_second_class'

Desired Output:

'MyClass_1':['method1_of_first_class','method2_of_first_class',...],
'MyClass_2':['method1_of_second_class','method2_of_second_class',...]
f.rodrigues
  • 3,123
  • 6
  • 22
  • 55
  • What's your expected output? – Avinash Raj Dec 09 '14 at 08:43
  • `[MyClass_1, [method1_of_first_class,method2_of_first_class,...]]` `[MyClass_2, [method1_of_second_class,method2_of_second_class,...]]` – f.rodrigues Dec 09 '14 at 08:45
  • 1
    Parsing code with regex is **hard**. See [1](http://stackoverflow.com/a/27149898/), [2](http://stackoverflow.com/a/17134110), [3](http://stackoverflow.com/a/21395083). I would suggest to use a dedicated parser. Also when asking regex questions, please define the language/tool you're using. – HamZa Dec 09 '14 at 08:51
  • https://docs.python.org/2/library/ast.html#module-ast – nhahtdh Dec 09 '14 at 09:15

2 Answers2

2

Since a class can contain another class or another function, and a function can contain another function or another class, simply grabbing the class and function declaration with regex will cause a loss in hierarchy information.

In particular, pydoc.py (which is available from version 2.1) in your Python installation is a prime example of such cases.

Parsing Python code in Python is simple, since Python includes a built-in parser in parser module and (from version 2.6) ast module.

This is a sample code to parse Python code in Python with ast module (version 2.6. and above):

from ast import *
import sys

fi = open(sys.argv[1])
source = fi.read()
fi.close()

parse_tree = parse(source)

class Node:
    def __init__(self, node, children):
        self.node = node;
        self.children = children

    def __repr__(self):
        return "{{{}: {}}}".format(self.node, self.children)

class ClassVisitor(NodeVisitor):
    def visit_ClassDef(self, node):
        # print(node, node.name)

        r = self.generic_visit(node)
        return Node(("class", node.name), r)

    def visit_FunctionDef(self, node):
        # print(node, node.name)

        r = self.generic_visit(node)
        return Node(("function", node.name), r)


    def generic_visit(self, node):
        """Called if no explicit visitor function exists for a node."""
        node_list = []

        def add_child(nl, children):
            if children is None:
                pass
                ''' Disable 2 lines below if you need more scoping information '''
            elif type(children) is list:
                nl += children
            else:
                nl.append(children)

        for field, value in iter_fields(node):
            if isinstance(value, list):
                for item in value:
                    if isinstance(item, AST):
                        add_child(node_list, self.visit(item))
            elif isinstance(value, AST):
                add_child(node_list, self.visit(value))

        return node_list if node_list else None

print(ClassVisitor().visit(parse_tree))

The code has been tested in Python 2.7 and Python 3.2.

Since the default implementation of generic_visit doesn't return anything, I copied the source of generic_visit and modified it to pass the return value back to the caller.

nhahtdh
  • 52,949
  • 15
  • 113
  • 149
0

You could use this regex to start with:

/class\s(\w+)|def\s(\w+)/gm

This will match all class and method names. To get it into the structure you mentioned in your comments you'll probably need to use an implementation language.

Edit: here's a PHP implementation example:

$output = array();

foreach ($match_array[0] as $key => $value) {
    if (substr($value, 0, 5) === 'class') {
        $output[$value] = array();
        $parent_key = $value;
        continue;
    }
    $output[$parent_key][] = $value;
}

// print_r($output);

foreach ($output as $parent => $values) {
    echo '[' . $parent . ', [' . implode(',', $values) . ']]' . PHP_EOL;
}

Example output:

[class MyClass_1, [def method1_of_first_class,def method2_of_first_class,def method3_of_first_class]]
[class MyClass_2, [def method1_of_second_class,def method2_of_second_class,def method3_of_second_class]]
scrowler
  • 23,403
  • 9
  • 52
  • 87
  • It's an example only. Up to you and what language you are using to implement this. – scrowler Dec 09 '14 at 09:19
  • 1
    @f.rodrigues, just be aware that this solution will not work for input that contains a string with the word `class` or `def` in it. i.e. `""" a class that does something """` would find a class called `that`. A more robust solution would be the nhahtdh's suggestion. – Bart Kiers Dec 09 '14 at 11:58