ColdFusion Histogram from a string of words

coldfusion

I've been doing some SEO stuff for <a href="http://www.greatdentalwebsites.com">Great Dental Websites</a> and I had a need to automatically generate some keywords and meta data.

In order to accomplish my particular task, I need a histogram of all the words in a long blob of text.  I was shocked to not be able to find anything on this in written in CF, so I set out to write my own:

 


<cfset faqText = getAllFAQs.question & " " & stripHTML(getAllFAQs.answer) />
<cfset skipwords = "all,another,any,anybody,anyone,anything,both,each,either,everybody,everyone,everything,few,he,her,hers,herself,him,himself,his,I,it,its,itself,little,many,me,mine,more,most,much,myself,neither,no,one,nobody,none,nothing,one,one another,other,others,ours,ourselves,several,she,some,somebody,someone,something,that,theirs,them,themselves,these,they,this,those,us,we,what,whatever,which,whichever,who,whoever,whom,whomever,whose,you,yours,yourself,yourselves,,a,the,to,are,of,can,is,but,have,that,want,What,my,an,for,all,out,and,look,very,need,get,case" />
------
<cfoutput>#getHistogram(faqText,skipwords, 10)#</cfoutput>

<cffunction name="getHistogram" returntype="array" hint="Creats a histogram of words">
    <cfargument name="sourceText" required="true" hint="The string of text we want to generate a histogram for" type="string" />
    <cfargument name="ignoreList" required="false" hint="comma delineated list of words to ignore" type="string" />
    <cfargument name="histogramLength" required="false" hint="number of words that we want to send back..ie only the top 5" type="string" />

    <cfset var histogramCount = structNew() /> <!--- our histogram! --->
    <cfset var sortedHistogram = "" />  <!--- a sorted array of our histogram --->
    <cfset var x = "" /> <!--- iterator --->
    <cfset var i = "" /> <!--- iterator --->
   
    <!--- loop through all of the text, assuming that a space separates a word --->
    <cfloop delimiters=" " list="#sourceText#" index="i">
   
        <!--- see if we have this already in our struct --->
        <cfif structKeyExists(histogramCount, "#i#")>
            <!--- we do! increase its count by 1 --->
            <cfset histogramCount[i] = histogramCount[i] + 1 />
        <cfelse>
            <!--- we do not, make a new key in the struct for this word --->
            <cfset histogramCount[i] = 1 />
        </cfif>
    </cfloop>
   
   
    <!--- Do we have an ignore list? --->
    <cfif structKeyExists(arguments, "ignoreList") and len(trim(arguments.ignoreList))>
        <!--- loop over the list of ignore words and remove any matches from our structure --->
        <cfloop delimiters="," list="#arguments.ignoreList#" index="x">
            <!--- does this word occur in our struct? --->
            <cfif structKeyExists(histogramCount, x)>
                <!--- yes, so remove it --->
                <cfset structDelete(histogramCount, x) />
            </cfif>
        </cfloop>
    </cfif>

    <!--- Sort the histogram based on most occurences of a given word --->
    <cfset sortedHistogram = StructSort(histogramCount, "numeric", "DESC") />

    <!--- see if we need to only show x number of words for this histogram --->
    <cfif structKeyExists(arguments, "histogramLength") and len(trim(arguments.histogramLength))>
        <cfset useNum = arguments.histogramLength + 1 />
        <cfloop index="y" from="#arrayLen(sortedHistogram)#" to="#useNum#" step="-1">

            <cfset ArrayDeleteAt(sortedHistogram, y) />       
        </cfloop>
    </cfif>

    <cfreturn sortedHistogram>

</cffunction>

One thing I would like to improve is for my function to return not only the list of words in terms of how often they came up, but also the number of times they came up.  This data gets dropped when we sort the struct into an array (which is a weird).

Any suggestions?

 

Barney said:
 
Here are a pair of implementations. One with identical API, one with slightly different API that returns words and counts together (as a query).

http://barneyb.com/r/word_hist.cfm

Your implementation is also in there, at the bottom. Totally off-the-cuff code, so it's undoubtedly imperfect, but a starting point.
 
posted 195 days ago
View Replies (3) || Add Comment Reply to: this comment OR this thread
 
.: HIDE REPLIES :.
Jeff said:
 
Ahh - you cheated and dipped into the java layer, but very impressive none-the-less; you have out classed me. Could you briefly explain whats going on with the java object and the regular expresion. Are you replacing all of the non alpha characters with a space?
 
posted 195 days ago
Add Comment Reply to: this comment OR this thread
 
Jeff said:
 
ah, I see, removes non-alphas and lowercases everything...very nice
 
posted 195 days ago
Add Comment Reply to: this comment OR this thread
 
Barney said:
 
The HashSet is just a set of unique values that I can use to easily check for the words to ignore. You could use a struct to emulate it (keys are words, values are irrelevant).

You figured out the regex exactly. Your initial implementation returned separate values for "course," and "course." on my sample string. That should probably use \w (for word characters) instead of specifically calling out the letters, but whatever. The lowercasing is just to make everything consistent; in this case the words "A" and "a" are the same.
 
posted 194 days ago
Add Comment Reply to: this comment OR this thread
 
Jeff said:
 
@Barney: One defect of this is it cannot detect duplicates, or prefixes or suffixes.

So headache != headaches
and test != testing

Any suggestions for how to improve this? Off the top of my head:
1) Go through the entire list of words and check for an s (and other common endings). This should catch most, but also give a few false positives.

2) Try to do some keyword matching, ie if you found "test", you would get a count for ANY words that contained test, like tests, testers, or testing. It would also, however, product false positives for "contest" or "attest"

Any other ideas?
 
posted 193 days ago
View Replies (1) || Add Comment Reply to: this comment OR this thread
 
.: HIDE REPLIES :.
Barney said:
 
Stemming is a huge PITA, especially with English where the rules of the language are completely insane. If you really need that sort of behaviour, rolling your own is probably not your best choice. After building the histogram, running a search and finding all pairs that differ only by a trailing 's' will get you closer, and I don't think it'll give you too many false positives, but that's an expensive search (Cartesian product), and the results are still going to be a ways off, so hard to say if it's worth it.
 
posted 193 days ago
Add Comment Reply to: this comment OR this thread
 

Search