The PODA Blog

News, views and articles from our membership

Excel looks for a job

Posted by Brian Mulder on September 6th, 2008

Wanted to post for a very long time (sorry Anne) but made no time earlier on to do it.

Recently i did quite some web scraping among other interesting coding. Those little projects are always a dilemma between the ease of using the Excel built-in tools, like web queries with all its natural scrape settings. Or we need to achieve something that is little more complex. Results depend on user interaction and manipulating buttons and options is easier through the HTMLdocument object.

If we choose to manipulate a browser document through code we need to automate the internet explorer web browser object. For debugging you can use early binding but since this is a simple example i’ll refrain from that.

The HTMLdocument object is the page loaded inside the container of the browser object. If you create a reference to the document you have access to most of the items on that page. Typing text in boxes and clicking buttons to direct forms to the server becomes a breeze. I skip the little detail of sifting through the bare HTML to find the names of the objects on the page and getting this automation to work. If you’re interested in the details leave a comment and try to answer.

Enough talking let’s do something fun!

Since i’m always looking for work let’s Excel find us a job.

We can use any site but settle for something i’ve already done, to prevent reinventing the wheel.
We use the UK job center site, with a friendly user interface, if you spend some time getting it to work.

The problems i faced were:

1. these are pages probably build with Apache Struts and passing parameters in the URL wasn’t available (at least not as i could find out in the short time available)
2. had to throw out the idea of directly using web queries against the page
3. needed to re-query the site again for job details
4. these pages are trickier to scrape than simple HTML or ASP(X) pages
5. something that kept me busy for a day Windows Vista and IE 7

This little block creates a browser object in memory and navigates to the page we need:


Set ie = CreateObject(”InternetExplorer.Application”)
With ie
.navigate “http://www.washijb.jobcentreplus.gov.uk/Internet/setLocale.do?country=GB&language=en&page=/initialise.do”
.Visible = True ‘ for debug set to true
End With

Wait until the page loads:


Do
DoEvents
Loop While ie.readyState 4

With the start page in front of us, we need to tell it what kind of industry we’re interested in. Let’s take a logical one, the IT group:


Set inp = ie.document.all.Item(”label_179″)
If Not (inp Is Nothing) Then
inp.Checked = True
Set inp = ie.document.all.Item(”Function”)
inp.Item(2).Click
End If

The code looks for the correct checkbox (179) and sets it to true. To continue We need to click one of the next buttons and click it to make things work for us. We face with another 3 pages of choices more or less the same as this first one. So i’ll continue with the last page the results!

If you thought “Wow great stuff!”. Let me help you of the dream, this is only 25% of the work, but probably you were really not that impressed. Our interest is in the job details not plain listings and that’s what we got now. Depending on your choices on page 4 (location and type of job) you end up with more than 100 listings of which only 100 are available. Good to see the government thinking for you, maybe i’m interested in job 133, well bad luck.

We got our listing and this page provides us with the references to the job details but we’re still dealing with a browser object and we need our details in Excel. So we save the page to disk:


FileName = ThisWorkbook.Path & “\JCPResults1.html”
sBody = ie.document.body.outerHTML
‘ write to file
Open FileName For Output As #3
Print #3, “” & sBody & “”
Close #3

That’s better, we got that part done and at this point the web query comes in, we simply query the file on disk to get the listings and load it into a worksheet. I use a template function for these queries but here’s the relevant part for the story:


With mysh.QueryTables.Add(Connection:= _
“URL;” & FileName, Destination:=mysh.Range(”A1″))
.WebFormatting = xlWebFormattingAll ‘ need html for links
.WebSelectionType = xlEntirePage
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
DoEvents
.Refresh BackgroundQuery:=False ‘ wait for data to arrive
End With

Great now we’re halfway!

We need the reference code for the job and this comes from cell “A” on our listing sheet and we call it RefCode. Since we loop through the jobs we can use this loop number to add the second parameter to the detail page url.

And repeat the first part of this post without the interaction with the browser object. We just retrieve the page, save it to disk and query it from excel with web query:


URL = “http://www.washijb.jobcentreplus.gov.uk/Internet/viewVacancy.do?selectedVacancy=” & JobNum - 2 & “&ref=” & RefCode

With ie
.navigate URL
End With

Do
DoEvents
Loop While ie.readyState 4

FileName = ThisWorkbook.Path & “\JobDetail.html”
sBody = ie.document.body.outerHTML
‘ write to file
Open FileName For Output As #3
Print #3, “” & sBody & “”
Close #3

Why would we do it like this? And not direct through a web query? Go ahead and show me a reliably working solution don’t think i didn’t try. The solution above works and my deadline couldn’t be pushed outward. Besides i don’t like to spend more than a few hours on a problem (something i regrettably did, read on).

OK, now we have listings and details and hey we’re done!

Excel delivers us interesting job opportunities in 5 minutes without going through the chore of browsing and setting the preferences. Wait for the results to load and go from detail pages back and forth.

Here is a screenshot of my endresult.

Jobsheet

I don’t post a working sample, you can always mail or comment for help on getting a scrape to work. I cannot guarantee a working solution but i’m always willing to think along for 5 minutes ;-)

Oh, i forgot about problem 5 mentioned above. That was something that kept me busy for 2 days. Since i didn’t have a machine to test this and couldn’t ask the client to test, i had to do it with a non-coder who had a Vista + IE 7 system and e-mail. Tracked it down to IE 7 security settings in Windows Vista, you need to add the site to your trusted site list within IE. Another possible solution that i didn’t have to try is hosting the browser control on an Excel form.

Happy coding, and till next time looking up where somebody lives in terms of the Postal service?

Leave a Reply

You must be logged in to post a comment.