posted: 7/27/08 filed under: ColdFusion


scraping google with ColdFusionRecently, I have been working on an internal tool for our sales and marketing departments. They wanted the ability to provide the URL of a company and return various information. One of the elements they wanted was the number of pages index in Google.

This was my first attempt to combine regular expressions with cfhttp. It’s pretty handy once you get the hang to it, and it can be a very powerful combination. Here’s the script I use to scrape the total number of pages indexed in Google.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<!---  We only need the domain name without the http://www. --->
<cfset siteurl = "jasonbartholme.com">
 
<!---  Regular expression which matches the pattern to determine count--->
<cfset googleregex = '<font size=-1>Results [sS]*? of about <b>([sS]*?)</b&gt; for'>
 
<!--- useragent is required because the page will not "get" properly otherwise --->
<cfhttp url="http://www.google.com/search?q=site%3A#siteurl#"
		method="get"
		resolveurl="false"
		useragent="#cgi.http_user_agent#">
</cfhttp>
 
<!---  Trims the whitespace in the content, and check for our regex pattern --->
<cfset sdoc = trim(cfhttp.filecontent)>
<cfset result = refindnocase(googleregex,sdoc,1,"true")>
 
<!---  cftry/cfcatch to see if refindnocase() returned a result --->
&lt;cftry&gt;
	<cfset resultcount = replace(mid(sdoc,result.pos[2],result.len[2]),',','','ALL')>
<cfcatch type="any">
	<cfset resultcount = 0&gt;
</cfcatch>
</cftry>
 
<!---  display our result --->
Pages indexed: <cfoutput>#resultcount#</cfoutput>

I must give credit to the post “How to Scrape Pages With ColdFusion“. It gave me the basics of what I needed to get me up and running.

Do you scrape data with ColdFusion? If so, do you use a similar method and do you use anything to help you with the regular expression syntax? Let me know, it would make for a good follow-up post.

8 Responses to “Scraping Google SERPs with ColdFusion”

  1. Edoc Says:

    Sorry for the off-topic post. Here’s a REBOL version. I bet you can follow the code without knowing the language, something that is often not possible with regular expressions.

    REBOL [Title: "Parse Google Pages Example"]

    goog: http://www.google.com/search?hl=en&btnG=Google+Search&q=
    domain: “jasonbartholme.com”
    query: rejoin [goog domain]

    parse/all read query [
    thru "Results" thru "of about " copy num to (print num)
    ]

    See rebol.com for more info on REBOL.

  2. Jason Bartholme Says:

    Hello Edoc,

    Honestly, I had never heard of REBOL, but the syntax looks very neat and clean. I will definitely be checking that out.

  3. Edoc Says:

    The code was slightly altered by the comments engine. The bold text

    copy num to

    should end with a closed “” tag. I’ve put a full version of the code here.

  4. KC Says:

    Just a quick note - your code states “from” instead of “for”,

    Thanks for the sample

  5. Edoc Says:

    Thanks — I’m not clear on what should be changed, but I caught a few pesky issues. Check the code on the link again, and you’ll see I fixed the line with “copy num to”.

  6. Rebol Tutorial - Rebol for SEO/Marketing Guys Says:

    [...] Bartholme wrote on his blog: I have been working on an internal tool for our sales and marketing departments. They wanted the [...]

  7. reboltutorial Says:

    Discovered your article through fleetsweet so I made an article on yours :)
    http://reboltutorial.com/blog/rebol-for-seo-marketing-guys/

  8. Middayc Says:

    Hi,

    Saw your blogpost on reboltutorial page…

    “Recently, I have been working on an internal tool for our sales and marketing departments. They wanted the ability to provide the URL of a company and return various information. One of the elements they wanted was the number of pages index in Google.”

    This sounds exactly like the definition of http://www.site-assistant.com :) (that is also made with rebol). if SA or some changes to it would help you let me know.

Leave a Reply

Switch to our mobile site