Recently, I have been working on an internal tool for our sales and marketing departments. They wanted the ability to provide the URL of a company and return various information. One of the elements they wanted was the number of pages index in Google.
This was my first attempt to combine regular expressions with cfhttp. It’s pretty handy once you get the hang to it, and it can be a very powerful combination. Here’s the script I use to scrape the total number of pages indexed in Google.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | <!--- We only need the domain name without the http://www. ---> <cfset siteurl = "jasonbartholme.com"> <!--- Regular expression which matches the pattern to determine count---> <cfset googleregex = '<font size=-1>Results [sS]*? of about <b>([sS]*?)</b> for'> <!--- useragent is required because the page will not "get" properly otherwise ---> <cfhttp url="http://www.google.com/search?q=site%3A#siteurl#" method="get" resolveurl="false" useragent="#cgi.http_user_agent#"> </cfhttp> <!--- Trims the whitespace in the content, and check for our regex pattern ---> <cfset sdoc = trim(cfhttp.filecontent)> <cfset result = refindnocase(googleregex,sdoc,1,"true")> <!--- cftry/cfcatch to see if refindnocase() returned a result ---> <cftry> <cfset resultcount = replace(mid(sdoc,result.pos[2],result.len[2]),',','','ALL')> <cfcatch type="any"> <cfset resultcount = 0> </cfcatch> </cftry> <!--- display our result ---> Pages indexed: <cfoutput>#resultcount#</cfoutput> |
I must give credit to the post “How to Scrape Pages With ColdFusion“. It gave me the basics of what I needed to get me up and running.
Do you scrape data with ColdFusion? If so, do you use a similar method and do you use anything to help you with the regular expression syntax? Let me know, it would make for a good follow-up post.





July 28th, 2008 at 5:02 pm
Sorry for the off-topic post. Here’s a REBOL version. I bet you can follow the code without knowing the language, something that is often not possible with regular expressions.
REBOL [Title: "Parse Google Pages Example"]
goog: http://www.google.com/search?hl=en&btnG=Google+Search&q=
domain: “jasonbartholme.com”
query: rejoin [goog domain]
parse/all read query [
thru "Results" thru "of about " copy num to (print num)
]
See rebol.com for more info on REBOL.
July 29th, 2008 at 9:33 am
Hello Edoc,
Honestly, I had never heard of REBOL, but the syntax looks very neat and clean. I will definitely be checking that out.
July 29th, 2008 at 10:37 am
The code was slightly altered by the comments engine. The bold text
copy num to
should end with a closed “” tag. I’ve put a full version of the code here.
August 21st, 2008 at 5:51 am
Just a quick note - your code states “from” instead of “for”,
Thanks for the sample
August 21st, 2008 at 1:15 pm
Thanks — I’m not clear on what should be changed, but I caught a few pesky issues. Check the code on the link again, and you’ll see I fixed the line with “copy num to”.
September 6th, 2009 at 5:57 am
[...] Bartholme wrote on his blog: I have been working on an internal tool for our sales and marketing departments. They wanted the [...]
September 6th, 2009 at 6:23 am
Discovered your article through fleetsweet so I made an article on yours
http://reboltutorial.com/blog/rebol-for-seo-marketing-guys/
September 17th, 2009 at 2:36 am
Hi,
Saw your blogpost on reboltutorial page…
“Recently, I have been working on an internal tool for our sales and marketing departments. They wanted the ability to provide the URL of a company and return various information. One of the elements they wanted was the number of pages index in Google.”
This sounds exactly like the definition of http://www.site-assistant.com
(that is also made with rebol). if SA or some changes to it would help you let me know.