home Links Articles Books Past Meetings Photos SiteMap
The MDCFUG is sponsored by TeraTech. Visit us at www.TeraTech.com

Please send
comments/questions to

michael@
teratech.com

 

CPCUG Monitor

Creating dynamic websites with ColdFusion

Part 5: Verity Free Text Search

by Michael Smith, TeraTech http://www.teratech.com/

What is ColdFusion?

In this article we continue to look at what ColdFusion is and how you can use it for dynamic website creation. We cover free text searching of both multiple files and databases using the ColdFusion Verity text search engine. Free text searching lets you look for words anywhere in a directory structure or database.

 

In case you missed previous articles that introduced ColdFusion, let me explain what it is. ColdFusion is a programming language based on standard HTML (Hyper Text Markup Language) that is used to write dynamic webpages. It lets you create pages on the fly that differ depending on user input, database lookups, time of day or what ever other criteria you dream up! ColdFusion pages consist of standard HTML tags such as <FONT SIZE="+2"> together with CFML (ColdFusion Markup Language) tags such as <CFSEARCH>, <CFIF> and <CFLOOP>. ColdFusion was introduced by Allaire in 1996 and is currently on version 4.

Text Searching in ColdFusion

Free Text searching is a very powerful programming tool that lets you search thousands of files or database records for any text any where within them. ColdFusion implements text searching looping with Verity using the <CFSEARCH> and <CFINDEX> tags. The search language allows for:

·         Wildcards - regular expression style use of ?, *, [], -, ^

·         Evidence operators - STEM, WILDCARD, WORD

·         Proximity operators - NEAR, PARAGRAPH, PHRASE, SENTENCE

·         Relational operators - CONTAINS, MATCHES, STARTS, ENDS, SUBSTRING

·         Concept operators - AND, OR, ACCRUE

·         Score operators - YESNO, PRODUCT, SUM, COMPLEMENT

 

You could write hundreds of lines of code to do these kinds of searches yourself, but it would run orders of magnitude slower.than using the single <CFSEARCH> tag. This is because Verity creates a word lookup index (or collection) of every piece of text in your files or records so that it can go straight to the ones your are searching for. This is analogous to the index in a book, that lists all the pages that a certain word appears on. If you imagine how tedious it would be to search for words in a book without an index, it will give you an idea of the advantages Verity can give your ColdFusion programs!

The Verity Engine

The free text indexing and searching functionality in ColdFusion is based on Verity, Inc.’s SEARCH’97 product. Indexing data is available both through the <CFINDEX> tag and the ColdFusion Administrator, where you can create and manage collections. Searching is done using the <CFSEARCH> tag. Output of search results to your pages is done using the same <CFOUTPUT> tag that you would use with database queries.

 

The Verity engine performs searches against collections. Collections consist of an index of all the words in all the files or records you want to search. Collection information includes:

·         Word indexes

·         An internal documents table

·         Logical pointers to actual document files

 

In your ColdFusion application, you can populate and search multiple collections, each of which can be designed to focus on a specific group of documents or queries, according to subject, document type, location, or any other logical grouping. Searches can be performed against multiple collections, giving you lots of flexibility in designing your search interface.

 

The <CFINDEX> tag lets you manage the data in an existing collection, including:

·         Indexing text or binary data in specified directories, or indexing ColdFusion queries.

·         Purging a collection of data.

·         Updating, refreshing, and optimizing a collection.

·          

Creating a Verity Collection

However, before you can perform any of these operations using <CFINDEX>, you need to create the collection in the ColdFusion Administrator. This is somewhat similar to how you have to create a datasource for SQL queries in the Administrator. Here are the steps for creating a collection:

1.        Open the ColdFusion Administrator Verity page.

2.        Enter a name for your collection. The Administrator fills in the Collection Root path with a corresponding directory path.

3.        Click Create. The new collection name and path appear in the Verity Collections list.

 

Figure 1. Creating a Verity Collection in the ColdFusion Administrator.

 

Once your collection is created, you can use either the Administrator or the <CFINDEX> tag to populate it with documents to search. Generally I use the administrator for static data and the <CFINDEX> tag for data that changes and must be re-indexed frequently.

 

 

Text Box: Here are some ideas on using Verity in your applications: 
·	Index your Web site and provide a generalized search mechanism, such as a form interface, for executing searches. 
·	Index specific directories containing ASCII documents for subject-based searching. 
·	Index ColdFusion queries, giving your end users the ability to perform custom queries against data you’ve indexed. Since collections are made up of data optimized for retrieval, this method is much faster than performing multiple database queries to return the same data. 
·	Manage and search collections generated outside of ColdFusion using native Verity tools. This additional capability requires only that the full path to the collection be specified in the index command. 
·	Index email generated by ColdFusion application pages and create a searching mechanism for the indexed messages. 
·	Build collections of inventory data and make those collections available for searching from your ColdFusion application pages. 
·	Support international users in a range of languages from both the <CFINDEX> and <CFSEARCH> tags. 

Indexing documents

ColdFusion allows you to index and search collections populated with data from:

·         ASCII text files.

·         Binary Office documents (see below for details about document types).

·         ColdFusion queries resulting from data returned by a <CFQUERY> operation.

You can index libraries of HTML and CFML documents and other ASCII text files. Choose specific documents or an entire directory tree as the target of your collection. Collections can be stored anywhere, so you have a lot of flexibility in accessing indexed data. This adds enormous value to any content-rich Web site.

 

For example, at TeraTech we are always coming across useful emails, documents, code snippets, web pages and newsgroup references. We never knew how to store these effectively for future reference. Paper printouts were hard to search and share in a team, and our existing computer copies were not much better! So we came up with a simple knowledgebase by creating a straightforward directory-based system that can be searched by Verity.  (It also has the added advantage of being very easy to save documents to.) If you make it to hard to save documents for reference, there will be no documents to search (it’s useless if no one uses it)! This is why we prefer saving the text documents to a simple directory system, instead of trying to be sophisticated and saving it in a database.

 

Whenever a document is found, either in email, news groups, or from the web, that is found to have some reference value, it is saved to the knowledge-base directory on our shared X: drive. It is useful to give the file a long, descriptive name, since this will basically be the title of the document when search results are returned. We have found that Eudora email convieniently saves email messages with a file name based on the subject of the message!

 

The ColdFusion code to create the Verity collection for our knowledge base of documents is:

<CFINDEX

      ACTION="REFRESH"

      COLLECTION="KnowledgeBase"

      KEY="X:\knowledgebase"

      TYPE="PATH"

      EXTENSIONS=".htm, .cfm, .dbm, .txt, .htm*, .doc, .rtf, .pdf, *."

      RECURSE="Yes">

 

Here we are refreshing a collection named KnowledgeBase that is stored in the directory X:\knowledgebase\. The recurse parameter tells Verity to index all subdirectories too. The extensions parameter lists the file types to index.

 

Note: if X: is not a physical drive on the ColdFusion server, you may have to refer to it by a UNC (Universal Naming Convention) such as \\mswebserver\x-drive. This is because by default the ColdFusion process runs without logging into the machine, and so it doesn’t see mapped drive letters such as X:.

 

The knowledge base directory is broken down into common developer’s areas of interest, such as JavaScript, ColdFusion, ASP, Access97, VB, HTML, etc.  New directories can be added as needed.  The directories are not really necessary as far as Verity is concerned, but are useful to prevent information scramble/overload (and in case we ever want to do any clean up of the data).

 

For many documents the <CFINDEX> tag can take some time to run (on our site it takes 45 seconds on average for 1000 documents). To avoid user delays and still keep the collection up to date as new documents are saved we use the ColdFusion scheduler to automatically run the refresh action above at 6am every day. A <CFMAIL> tag emails me to confirm that the command ran ok.

Text Box: Document types supported
Verity supports a wide array of binary document types. This means you can index word processing, spreadsheet, and other document types and produce search results that include summaries of these documents. The following document types are supported:
·	ASCII text
·	Adobe Acrobat PDF
·	Ami Pro
·	WordPerfect
·	Word, RTF
·	Excel
·	PowerPoint

Verity also supports foreign language indexing using the ColdFusion International Language Search Pack  in: German, French, Danish, Dutch, Finnish, Italian, Norwegian, Portuguese, Spanish and Swedish.

<CFSET starttime=now()>

<CFINDEX ACTION="REFRESH"

   COLLECTION="KnowledgeBase"

   KEY="\\mswebserver\x-drive\knowledgebase"

   TYPE="PATH"

   EXTENSIONS=".htm, .cfm, .dbm, .txt, .htm*, .doc, .rtf, .pdf, *."

   RECURSE="Yes">

<CFMAIL

   TO="[email protected]" FROM="[email protected]"

   SUBJECT="Knowledgebase refresh"

   SMTPSERVER="smtp.teratech.com">

Knowledge base successfully refeshed

Time taken: #DateDiff('s', starttime, now())# seconds.<br>

</cfmail>

Indexing queries

In addition to indexing documents, Verity can index your output from a <CFQUERY>. Of course you could do this in SQL using the LIKE operator or the INSTR() function, but both of these methods use full table scans and so are slow on any but the smallest databases. Another advantage is that the search interface is simple both for the user and for you coding it, as typically you have one input field that is searched through all fields in the database.

 

To index a ColdFusion query:

1.     Define a logical name and location for your collection using the ColdFusion Administrator Verity page.

2.     Execute a <CFQUERY> to retrieve data from the desired ODBC data source.

3.     Generate the collection using the <CFINDEX> tag.

 

The query set is indexed using the <CFINDEX> tag in which you specify a KEY, typically a unique value like the primary key, and the column in which you want to conduct searches, the BODY. In our example we have a database of email messages to query from.

 

<CFQUERY NAME="Messages" DATASOURCE="TestDatasource">

   SELECT Message_ID , Body, UserName

   FROM Messages

</CFQUERY>

<CFINDEX COLLECTION="Messages"

   ACTION="UPDATE"

   TYPE="CUSTOM" 

   BODY="Body" 

   KEY="Message_ID" 

   TITLE="UserName"

   QUERY="Messages">

 

This <CFINDEX> statement specifies the Body column as the core of the collection and names the KEY as the Message_ID column, the table's primary key. Note that the TITLE attribute names the UserName column from the Messages table. The TITLE attribute can be used to designate an output parameter when you are displaying your Verity search results.

<CFOUTPUT>

   Message number #SearchOutput.Message_ID# was written by

     #SearchOutput.TITLE#.

</CFOUTPUT>

We will explain in detail how to search the collection below.

 

To index more than one column in a collection, enter a comma-separated list of column names for values of the BODY attribute, such as:

BODY=FirstName,LastName,Company

As an alternative, you can use the concatenation function of your DBMS in a SELECT statement, such as:

SELECT FIRSTNAME+’ ‘+LASTNAME AS WHOLENAME

·         A space is inserted between each concatenated value to avoid mixing up words. You would then generate a collection from WHOLENAME. 

 

Searching a Verity collection

The <CFSEARCH> tag lets you search one or more Verity collections. Searches can either be for single words, multiple words or complex proximity operators such as within 3 words or same sentence.

 

In our file based Knowledge base example:

<CFSEARCH

      COLLECTION="KnowledgeBase"

      NAME="Articles"

      TYPE="SIMPLE"

      CRITERIA="#URL.SearchText#">

 

Here we are searching the collection called KnowledgeBase with a simple word search for words contained in the URL parameter SearchText. This parameter has been passed on the URL string to our search results page. The list of files matching the search is returned in the query named Articles.

 

To display the search results a pageful at a time we use the <CFOUTPUT> tag with the startrow and maxrows parameters. These would be set using paging buttons on the results page, which to save space we have not shown here. We use a table format to make the display easier to read.

 

<TABLE BORDER="0" CELLPADDING="2" CELLSPACING="2">

<TR>

<TD><B>Score</B></TD><TD><B>Summary</B></TD>

</TR>

<CFOUTPUT QUERY="Articles" STARTROW=#StartAt# MAXROWS=#stepsize#>

<TR>

<TD WIDTH="30%" VALIGN="TOP">#score#</td>

<TD WIDTH="70%" VALIGN="TOP">

<A HREF="/knowledgebase/#URLEncodedFormat(url)#/#Replace(url, ' ', '','ALL')#" TARGET="_new">

<B>#Replace(key, "\\mswebserver\x-drive\knowledgebase\", '','ALL')#</B></A>

<BR>#HTMLEditFormat(Summary)#&nbsp;

</TD>

</TR>

</CFOUTPUT>

</TABLE>

 

In the output we use the standard <CFSEARCH> output columns score, url, key and summary (see below). We also use the URLEncodedFormat function in case the file name contains spaces and we add the file name on the end of the URL a second tie with spaces stripped so that if the file is downloaded it will be saved with the stripped name. For example “My Test.doc” would have URL My%20Test%2Edoc/MyTest.doc and if you clicked on the link the file name would be MyTest.doc. The target="_new" parameter of the HTML <A HREF> tag tells the browser to use a new window when you click on the link. We use the HTMLEditFormat function on the summary variable because if it contains HTML it could screw up our display - the function converts the HTML codes to displayable text.

 

A full list of verity variables is:

·         KEY — the value of the KEY attribute defined in the CFINDEX tag used to populate the collection. In our case the filename and path.

·         TITLE — Returns the value of the TITLE attribute defined by the <TITLE> HTML tag in any HTML or ColdFusion application page file that was indexed by CFINDEX. If the collection was TYPE=CUSTOM, TITLE returns the value of the TITLE attribute defined by the CFINDEX tag. If the collection was TYPE=FILE, TITLE also returns the value of the TITLE attribute defined by the CFINDEX tag.

·         SCORE — Returns the relevancy score of the document based on the search criteria from 0 to 100.

·         URL — Returns the value of the URLPATH attribute defined in the CFINDEX tag used to populate the collection.

·         SUMMARY - the best three sentences or 500 characters of documents returned by a search.

·         CUSTOM1, CUSTOM2 - user defined key fields

·         RECORDCOUNT — The total number of records returned by the query

·         CURRENTROW — The current row of the query being processed by CFOUTPUT

·         RECORDSSEARCHED — The total number of records in the index that were searched.

 

·         Figure 2: Verity search results page

 

Verity Search Query Language

You can do more than just search for single words using the <CFSEARCH> CRITERIA parameter. You can also enter comma-delimited strings and use wildcard characters (regular expressions). By default, a simple query searches for words, not strings. For example, entering the word "all" will find documents containing the word "all" but not "allegorical." You can use wildcards, however to broaden the scope of the search. "all*" will return documents containing both "all" and "alliterate." Case is ignored, but only when (as above) the search string is all lowercase or all uppercase.  If the criteria is mixed case ("All"), only the same case would match (only "All", not "all" or "ALL").

 

You can enter multiple words separated by commas: software, Microsoft, Oracle. The comma in a Simple query expression is treated like a logical OR. If you omit the commas, the query expression is treated as a phrase, so documents would be searched for the phrase "software Microsoft Oracle."

 

You can use the AND, OR, and NOT operators in a simple query: software AND (Microsoft OR Oracle). To include an operator in a search, you surround it with double quotation marks: software "and" Microsoft. This expression searches for the phrase "software and Microsoft."

 

A simple query employs the STEM operator and the MANY modifier. STEM searches for words that derive from those entered in the query expression, so that entering "find" will return documents that contain "find," "finding," "finds," etc. The MANY modifier forces the documents returned in the search to be presented in a list based on a relevancy score.

 

For a full list of Verity operators see the on-line help page at our knowledge base page http://www.teratech.com/knowledgebase/. You can also try out our verity knowledge base too!

Summary

In this article we learned how to index both documents and large database queries for free text searches using Verity. We used the CFINDEX and CFSEARCH tags together with a CFOUTPUT to display results

To Learn More

You can download a free 30 day-evaluation version of ColdFusion from Allaire or request a free eval CD-ROM from

the Allaire website http://www.allaire.com/ 

 

Allaire Corporation

1 Alewife Center

Cambridge, MA 02140

 

Tel: 617.761.2000 voice

Fax: 617.761.2001 fax

Toll Free: 888.939.2545

Email: [email protected]

Web: www.allaire.com

 

ColdFusion Resources

Allaire also maintains an extensive knowledge base and tech support forums on their website.

CPCUG and TeraTech ColdFusion Conference http://www.cfconf.org/

TeraTech maintains a ColdFusion code cuttings called ColdCuts at http://www.teratech.com/ColdCuts/. This page also has links to about a dozen ColdFusion white papers in the CF Info Center.

The Maryland ColdFusion User Group meets the second Tuesday of each month at Backstreets Cafe, 12352 Wilkins Avenue, Rockville. See http://www.cfug-md.org/ for details and directions.

The DC ColdFusion User Group meets the first Wednesday each month at Figleaf , 16th and P St NW, Washington DC. See the DCCFUG page on http://www.figleaf.com/ for details and directions.

Bio

Michael Smith is president of TeraTech, a ten year old Rockville Maryland based consulting company that specializes in ColdFusion, Database and Visual Basic development. You can reach Michael at [email protected] or 301-424-3903.


Home | Links | Articles | Past Meetings | Meeting Photos | Site Map
About MDCFUG | Join | Mailing List |Forums | Directions |Suggestions | Quotes | Newbie Tips
TOP

Copyright © 1997-2024, Maryland Cold Fusion User Group. All rights reserved.
< >