ColdFusion Histogram from a string of words
coldfusionI've been doing some SEO stuff for <a href="http://www.greatdentalwebsites.com">Great Dental Websites</a> and I had a need to automatically generate some keywords and meta data.
In order to accomplish my particular task, I need a histogram of all the words in a long blob of text. I was shocked to not be able to find anything on this in written in CF, so I set out to write my own:
<cfset faqText = getAllFAQs.question & " " & stripHTML(getAllFAQs.answer) />
<cfset skipwords = "all,another,any,anybody,anyone,anything,both,each,either,everybody,everyone,everything,few,he,her,hers,herself,him,himself,his,I,it,its,itself,little,many,me,mine,more,most,much,myself,neither,no,one,nobody,none,nothing,one,one another,other,others,ours,ourselves,several,she,some,somebody,someone,something,that,theirs,them,themselves,these,they,this,those,us,we,what,whatever,which,whichever,who,whoever,whom,whomever,whose,you,yours,yourself,yourselves,,a,the,to,are,of,can,is,but,have,that,want,What,my,an,for,all,out,and,look,very,need,get,case" />
------
<cfoutput>#getHistogram(faqText,skipwords, 10)#</cfoutput>
<cffunction name="getHistogram" returntype="array" hint="Creats a histogram of words">
<cfargument name="sourceText" required="true" hint="The string of text we want to generate a histogram for" type="string" />
<cfargument name="ignoreList" required="false" hint="comma delineated list of words to ignore" type="string" />
<cfargument name="histogramLength" required="false" hint="number of words that we want to send back..ie only the top 5" type="string" />
<cfset var histogramCount = structNew() /> <!--- our histogram! --->
<cfset var sortedHistogram = "" /> <!--- a sorted array of our histogram --->
<cfset var x = "" /> <!--- iterator --->
<cfset var i = "" /> <!--- iterator --->
<!--- loop through all of the text, assuming that a space separates a word --->
<cfloop delimiters=" " list="#sourceText#" index="i">
<!--- see if we have this already in our struct --->
<cfif structKeyExists(histogramCount, "#i#")>
<!--- we do! increase its count by 1 --->
<cfset histogramCount[i] = histogramCount[i] + 1 />
<cfelse>
<!--- we do not, make a new key in the struct for this word --->
<cfset histogramCount[i] = 1 />
</cfif>
</cfloop>
<!--- Do we have an ignore list? --->
<cfif structKeyExists(arguments, "ignoreList") and len(trim(arguments.ignoreList))>
<!--- loop over the list of ignore words and remove any matches from our structure --->
<cfloop delimiters="," list="#arguments.ignoreList#" index="x">
<!--- does this word occur in our struct? --->
<cfif structKeyExists(histogramCount, x)>
<!--- yes, so remove it --->
<cfset structDelete(histogramCount, x) />
</cfif>
</cfloop>
</cfif>
<!--- Sort the histogram based on most occurences of a given word --->
<cfset sortedHistogram = StructSort(histogramCount, "numeric", "DESC") />
<!--- see if we need to only show x number of words for this histogram --->
<cfif structKeyExists(arguments, "histogramLength") and len(trim(arguments.histogramLength))>
<cfset useNum = arguments.histogramLength + 1 />
<cfloop index="y" from="#arrayLen(sortedHistogram)#" to="#useNum#" step="-1">
<cfset ArrayDeleteAt(sortedHistogram, y) />
</cfloop>
</cfif>
<cfreturn sortedHistogram>
</cffunction>
One thing I would like to improve is for my function to return not only the list of words in terms of how often they came up, but also the number of times they came up. This data gets dropped when we sort the struct into an array (which is a weird).
Any suggestions?




Loading....