2
import urllib2
import re
import csv
from bs4 import BeautifulSoup

def get_BlahBlah(num1, num2, num3, num4):
    url1 = "http://BlahBlah.com/person_profile/"
    url2 = "?-id="
    url3 = "."
    url4 = "&source=personalranking="
    urlComplete = url1 + str(num1) + url2 + str(num2) + url3 + str(num3) + url4 + str(num4) 
    page = urllib2.urlopen(urlComplete)
    soup_BlahBlah = BeautifulSoup(page, "lxml")
    page.close()

    rank_tag = soup_BlahBlah.find('h1', class_="personal_rank") 

    if rank_tag:
        rank_string = rank_tag.span.string
        return rank_string

for num1_count in range(28343512, 28343512):
    for num2_count in range(9999888888, 9999888889):
        for num3_count in range (7777, 7778):
            for num4_count in range(0, 1):

                record = get_BlahBlah(num1_count, num2_count, num3_count, num4_count)

                saveFile = open('BlahBlah.csv', 'a')
                saveFile.write(str(record)+'\n')
                saveFile.close()

                num4_count += 1
            num3_count += 1
        num2_count += 1
    num1_count += 1

The above code is working but I want to tweak it better and more efficient for my needs. What I am trying to do is to crawl and extract the "rank" information (user class "personal_rank" tag) for each unique individual. And I want to crawl all the people in the entire site.

The site's URL structure is composed of various static and varying (numeric) parts, for example:

http://BlahBlah.com/person_profile/XXXXXXXX?-id=XXXXXXXXXX.XXXX&source=personalranking=X *notice this is not the site I want to crawl, just used as an example

Where X can be any number from 0-9. Here are my three different questions:

  • Let's say all the numeric portions on the URLs are unique for a single person, and I can to cycle through the multiple loops like my current codes, is there other way (more efficient) I should be doing (instead of having four loops since I find it very time-consuming).

  • Now, let's say, only num1_count is unique to a single person, and num2_count, num3_count, and num4_count portions can be any combinations (as long as the corresponding digits remain the same) and will still refer to the same person (see example below), how can I use Regex to replace my current code? And if I use Regex to represent parts of the URLs, how can I combine it with loops?

1) http://BlahBlah.com/person_profile/12345678?-id=1111111111.1111&source=personalranking=1 refers to Peter Pan 2) http://BlahBlah.com/person_profile/12345678?-id=2222222222.1111&source=personalranking=1 also refers to Peter Pan 3) http://BlahBlah.com/person_profile/12345670?-id=2222222222.1111&source=personalranking=1 refers to Robin King

  • Follow up with point number 2, let's say the number of digits for num1_count-num3_count matter, but the last numeric portion doesn't matter in a sense that it can be a single or double digits and will still refer to the same person, how can I code it?

Thanks in advance.

KubiK888
  • 3,525
  • 10
  • 49
  • 89
  • Regex' purpose is the opposite of what you're trying to do: they'll detect (and replace or parse into variables) those numeric sections in a generated URL; but they won't convert a single regex into a large number of URLs in the first place. For that, you still need your loops. If four vars identify each person, you must brute-force-search with nested loops over all 4 vars, which will take a Very Long Time. Unless you have access to the list of users, in which case, loop over that instead so you don't try the non-existend combinations. – Dewi Morgan Aug 24 '14 at 20:53

0 Answers0