1

I'm trying to search and capture for a certain type of file names (e.g. /app.css /main.js) within another file (a log file) .

The regex I've constructed is this:

^\/([a-zA-Z0-9_-]+)[.](css|js)

I'm trying to get the first capture group, i.e file name without extension (app main etc from above example) and this is how I'm searching

haystack = '/main.js'
matches = re.finditer(pattern, haystack, re.MULTILINE)

it works fine and I'm able to get the captured groups. However if I do the same while reading a file, it doesn't work

pattern = r"'^\/([a-zA-Z0-9_-]+)[.](css|js)'"
for i, line in enumerate(open('log.txt', 'r')):
    haystack = line.rstrip()
    matches = re.finditer(pattern, haystack, re.MULTILINE)

the content of the log.txt is something like this:

duis ut diam quam /app.css porttitor
app.css
main.js
purus sit (amet volutpat /main.js)

it doesn't match in any of the lines for the above file, even though it should have in all four line!

Termin4t0r
  • 169
  • 1
  • 7
  • 2
    You specified the `^` anchor which means "start matching at beginning of text", so the result you are getting is correct there is no match – Giacomo Alzetta Apr 05 '19 at 07:05

2 Answers2

2

Change your regex into:

/([a-zA-Z0-9_-]+)\.(css|js)

demo: https://regex101.com/r/Aub4dw/1/

You do not need the beginning of line anchor. It works with haystack = '/main.js' because /main.js is exactly at the beginning of the string.

Allan
  • 11,170
  • 3
  • 22
  • 43
2

From your content of file, it can be seen that the filename doesn't start from start of line, hence you need to get rid of ^ from the regex so it can match anywhere in the file. You can use this regex,

/([a-zA-Z0-9_-]+)[.](css|js)

As, you can see, in Python you don't need to escape a / as \/ because / is not the default delimiter in the regex unlike in some other languages like JS and PHP.

Also, in case you want to find all the filename's first part without extension, just use findall (in case that's better for you) instead of iterating one by one using finditer and make (css|js) a non-group like this (?:css|js) with this code,

import re

s = '''duis ut diam quam /app.css porttitor
app.css
main.js
purus sit (amet volutpat /main.js)'''

print(re.findall(r'/([a-zA-Z0-9_-]+)[.](?:css|js)', s))

Prints,

['app', 'main']

Demo

Pushpesh Kumar Rajwanshi
  • 17,850
  • 2
  • 16
  • 35