0

What I want to do is store the top 500 sites listed on alexa.com into a .txt file.
Here's how the program works.

When my .net browser visit one of alexa's page, it stores all the links into an html collection.
Then, I use a loop to find out if the link's text contains a "."
If it does, it store the url into a .txt file.

The problem is, String.Contains(); won't work so, I end storing useless information as well.
Why won't String.Contains(); work?

Error Message : Object reference not set to an instance of an object.

Important parts

Robot.cs

public HtmlElementCollection page_elements
{
    get;
    set;
}

public void exec_task()
{
    var url_to_txtfile = new StreamWriter("urls.txt", true);

    foreach (HtmlElement element in page_elements)
    {
        string element_text = element.InnerText;
        if (element_text.Contains(".")) // Object reference not set to an instance of an object.
            url_to_txtfile.WriteLine(element_text);
    }

    url_to_txtfile.Close();

    next_page();
}

Form1.cs

private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    Bot.page_elements = webBrowser1.Document.GetElementsByTagName("a");
    Bot.pages_visited++;

    if (Bot.pages_visited <= Bot.pages_to_visit)
    {
        Bot.exec_task();
        webBrowser1.Url = new Uri(Bot.url);
    }

}

Source Code

Robot.cs

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace AlexaBot
{
    class Robot
    {

        public Robot(string link, byte pages, byte items)
        {
            url = link;

            pages_to_visit = pages;
            link_per_page = items;

            pages_visited = -1;
        }

        public byte pages_to_visit
        {
            get;
            set;
        }

        private byte link_per_page
        {
            get;
            set;
        }

        public sbyte pages_visited
        {
            get;
            set;
        }

        public string url
        {
            get;
            set;
        }

        public HtmlElementCollection page_elements
        {
            get;
            set;
        }

        public void exec_task()
        {
            var url_to_txtfile = new StreamWriter("urls.txt", true);

            foreach (HtmlElement element in page_elements)
            {
                string element_text = element.InnerText;
                if (element_text.Contains("."))
                    url_to_txtfile.WriteLine(element_text);
            }

            url_to_txtfile.Close();

            next_page();
        }

        private void next_page()
        {
            if (pages_visited < 11)
                url = url.Remove(url.Length - 1) + pages_visited.ToString();
            else
                url = url.Remove(url.Length - 2) + pages_visited.ToString();
        }
    }
}

Form1.cs

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace AlexaBot
{
    public partial class Form1 : Form
    {
        Robot Bot;

        public Form1()
        {
            InitializeComponent();
            Bot = new Robot("http://www.alexa.com/topsites/global;0", 20, 25);
            webBrowser1.Url = new Uri(Bot.url);
        }

        private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            Bot.page_elements = webBrowser1.Document.GetElementsByTagName("a");
            Bot.pages_visited++;

            if (Bot.pages_visited <= Bot.pages_to_visit)
            {
                Bot.exec_task();
                webBrowser1.Url = new Uri(Bot.url);
            }

        }
    }
}
Ian Nelson
  • 51,299
  • 20
  • 72
  • 100

1 Answers1

7

string.Contains() works just fine. The error is telling you that your object is null, and you can't dereference a null object. So in this line:

if (element_text.Contains("."))

clearly element_text is null. You should wrap it in a null check, maybe something as simple as this:

if (!string.IsNullOrWhiteSpace(element_text))
    if (element_text.Contains("."))

(Or, for older versions of .NET, use string.IsNullOrEmpty() instead.)

page_elements probably contains a lot of HTML elements, not all of which have an InnerText value. Those that don't will be null. There's probably additional filtering you can do here to narrow your search, including using a more mature DOM parser (as mentioned in a comment on the question).

David
  • 176,566
  • 33
  • 178
  • 245