Umbraco Examine v4.x - Powerful Umbraco Indexing

by Shannon Deminick 20. April 2009 17:55

This post it outdated. For the latest information on Examine please refer to either the Examine page on our site or the Examine CodePlex project home

Umbraco Examine is a powerful, fully configurable, and extensible library used for indexing Umbraco content to allow for fast and easy content searching. It utilizes the Lucene.Net library which is included in the Umbraco installation (v2.x). It is extremely easy to setup and caters for simple indexing/searching to very complex index/searching by utilizing it's fully extensible codebase and it's event model. The library was built with .Net 3.5 SP1 and has not been tested with previous versions of .Net.

Basic Setup

  • Copy the DLL files to the bin folder
  • Add the following to the <configSections> portion of your Web.config file:
<section name="UmbLuceneIndex" 
type="TheFarm.Umbraco.Lucene.Configuration.IndexSets, TheFarm.Umbraco.Lucene" />
  • For the most basic setup, add the following to the configuration in your Web.config (Also see the readme.txt and app.config files in the binaries download!):
<UmbLuceneIndex DefaultIndexSet="MyIndexSet" EnableDefaultActionHandler="true">
<IndexSet SetName="MyIndexSet" IndexPath="~/data/UmbracoExamine/" MaxResults="100">
<IndexUmbracoFields>
<add Name="id" /> <!-- REQUIRED -->
<add Name="nodeName" /> <!-- REQUIRED -->
<add Name="updateDate" />
<add Name="writerName" />
<add Name="path" />
<add Name="nodeTypeAlias" /> <!-- REQUIRED -->
</IndexUmbracoFields>
<IndexUserFields>
<add Name="PageTitle"/>
<add Name="PageContent"/>
</IndexUserFields>
<IncludeNodeTypes />
<ExcludeNodeTypes />
</IndexSet>
</UmbLuceneIndex>
  • Create the folder: ~/data/UmbracoExamine/ since this is what has been specified for the index path above. 
    • Ensure that the IIS user has full control on this folder.
  • Since EnableDefaultActionHandler is set to true, each time a node is published, it will be indexed based on the rules suplied in the configuration. When a node is unpublished, it will automatically be removed from the index.
  • Log into Umbraco, publish a node and verify that files have been created in the index path as specified above.

Basic Search

  • To perform a search:
UmbracoIndexer examine = new UmbracoIndexer();
List<SearchResult> results = examine.Search("find this", true);
  • The returned structure is simple, containing 3 properties: Id, Score and Fields:
public int Id { get; set; }
public float Score { get; set; }
public Dictionary<string, string> Fields { get; set; }
  • The Fields property contains all of the field data that has been configured in the web.config file.

Advanced Setup

You can create multiple indexes depending on your needs. For example, you may want to have different indexes for different portal sites in your content tree, or different indexes to separate the type of content being indexed such as one for News and one for Forum, as an example. Creating different indexes if easy:

<UmbLuceneIndex DefaultIndexSet="Site1" EnableDefaultActionHandler="true"> 	<!-- Create an index for a site called 'Site1' which has a starting parent   	node in the content tree of 1234. Only nodes that have the Id, or are children of  	node 1234 will be indexed. -->     <IndexSet SetName="Site1" IndexPath="~/data/indexes/site1/" MaxResults="100" 	IndexParentId="1234">       <IndexUmbracoFields>         <add Name="id" /> <!-- REQUIRED -->         <add Name="nodeName" /> <!-- REQUIRED -->         <add Name="updateDate" />         <add Name="writerName" />         <add Name="path" />         <add Name="nodeTypeAlias" /> <!-- REQUIRED -->         <add Name="parentID"/>       </IndexUmbracoFields>       <IndexUserFields>         <add Name="PageTitle"/>         <add Name="PageContent"/> 	<add Name="CommentText"/> 	<add Name="CommentUser"/>         <add Name="umbracoNaviHide"/>       </IndexUserFields>       <IncludeNodeTypes> 	<add Name="HomePage" />         <add Name="BasicPage" />         <add Name="Comment" />       </IncludeNodeTypes>       <ExcludeNodeTypes />     </IndexSet> 	<!-- Create an index for a site called 'Site2' which has a starting parent node in  	the content tree of 4567. Only nodes that have the Id, or are children of node 4567  	will be indexed. -->     <IndexSet SetName="Site2" IndexPath="~/data/indexes/site2/" MaxResults="100"  	IndexParentId="4567">       <IndexUmbracoFields>         <add Name="id" /> <!-- REQUIRED -->         <add Name="nodeName" /> <!-- REQUIRED -->         <add Name="updateDate" />         <add Name="writerName" />         <add Name="path" />         <add Name="nodeTypeAlias" /> <!-- REQUIRED -->       </IndexUmbracoFields>       <IndexUserFields>         <add Name="PageTitle"/>         <add Name="PageContent"/>         <add Name="umbracoNaviHide"/>	 	<!-- You can add as many user fields here that you would like to be indexed... -->		       </IndexUserFields>       <IncludeNodeTypes />       <ExcludeNodeTypes> 	<!-- Index everything except for document types of 'UserNotes' --> 	<add Name="UserNotes" />       </ExcludeNodeTypes>     </IndexSet> </UmbLuceneIndex>

Advanced Search

There are a few overriden search methods you can use to perform different types of searches, all depends on what kind of results you want to acheive:

//This will create a new examiner to search in Site1 since Site 1 is //listed as the default Index in the configuration. UmbracoIndexer examine = new UmbracoIndexer(); List<SearchResult> results = examine.Search("find this in Site1", true);  //This will create a new examiner to search in Site2 UmbracoIndexer examine2 = new UmbracoIndexer("Site2"); List<SearchResult> results2 = examine2.Search("find this in Site2", true);  //disables wild card searching List<SearchResult> results3 = examine2.Search("find exact matches in Site2", false);  //searches site 2 but only in NewsArticle document types List<SearchResult> results4 = examine2.Search("find news in Site2",  	"NewsArticle", true, null);  //searches site 2 but only for nodes that are children of the node with ID 4999 List<SearchResult> results5 = examine2.Search("find something in Site2", "", true, 4999);  //searches site 1, in all of it's defined doc types to be searched but only in  //the properties: PageTitle and PageContent and will only return a maximum  //of 10 results. List<SearchResult> results6 = examine.Search("find in Site1", "",  	true, null, new string[] {"PageTitle","PageContent"}, 10);
Categories: .Net | Umbraco

Comments

7/11/2009 11:27:48 PM #

Nice addition, thanks for this.

I needed to change the configSection to this:

<section name="UmbLuceneIndex" type="TheFarm.Umbraco.Lucene.Common.Configuration.IndexSets, TheFarm.Umbraco.Lucene.Common" />

John United States

7/12/2009 1:31:30 PM #

Thanks for the heads up. There will be some larger changes to this library in the next few months on CodePlex. The namespaces will be completely changed. We'll be putting heaps of functionality into it so keep checking back!

ShannonDeminick Australia

7/31/2009 7:41:51 PM #

Hi - this looks very good - would it be possible to set this to index media, like PDFs for example?

Thomas United Kingdom

8/2/2009 11:44:49 AM #

Hopefully soon i'll be updating the source to make indexers and searchers provider model based. Though the solution is very scalable currently by means of inheriting from the classes in Umbraco Examine, having a provider model will make it extraordinarily flexible. I'm thinking of having a provider for each index-able item (i.e. Content, Media, Files,etc...) Then people will be able to just develop their own providers to do whatever they want

ShannonDeminick Australia

8/14/2009 1:24:30 AM #

Hi Shannon.
This looks so good - but as you know "more wants more". And as we were discussing at CodeGarden09 I'm looking for a search engine that will allow me to catch pdfs and docs which are uploaded via a document type in the content section - and not via the media section. Would you be able to help me out with some code for an indexer.cs - and/or would you be willing to include such a feature in the upcoming version of Umbraco Examine?
Thanks for any kind of response :-)

Søren Tidmand Denmark

8/14/2009 4:04:32 AM #

Lucene can index anything that can be converted to a string.

So you really just need pdf to text:

http://www.codeproject.com/KB/cs/PDFToText.aspx

AnthonyDang Australia

8/21/2009 3:22:52 AM #

It's looking like Examine will be put into the final 4.1 build. It will be redeveloped with a provider model so you can write or implement anyone else's provider to index anything.
We're going to implement Examine into 4.1 for internal indexing/searching but leave exposing public indexing/searching up to packages for people to create.
Essentially since it will be a provider model, you will be able to just implement a custom provider to index or search anything based on your requirements.
Examine will be shipped with it's own provider but it will be a basic content and media provider with no support for PDF's or Word Docs but this can easily be developed.

ShannonDeminick Australia

9/25/2009 3:18:58 PM #

This is really a usefull extension.

Buy I have a question. In web.config (umbaco 4.0.2.1) I add instructions to create a multiple indexes. However was created only one (in a specified directory). It was default index set. In all
I need to create 2 index sets. When I change DefaultIndexSet from first to second, then the second index create, but first stops to be updated. What am I doing wrong?

p.s. I also changed configSections to:
<section name="UmbLuceneIndex" type="TheFarm.Umbraco.Lucene.Common.Configuration.IndexSets, TheFarm.Umbraco.Lucene.Common" />

Sergiy Ukraine

9/28/2009 3:35:44 AM #

Hi, Umbraco Examine has been re-factored to support a provider model, I haven't created a new release on CodePlex for it yet but you can download the latest build from the source control tab. It is much more robust than the previous version so i recommend that you use it instead. I will be posting the docs for the new version soon but it should be relatively easy to figure out if you download the latest build.

ShannonDeminick Australia

10/12/2009 2:15:04 PM #

Shannon, thanks for your reply.

I have downloaded Examine from CodePlex and have learned test project. But I have a few questions about. So, I'll wait for the official beta release of Umbraco 4.1 with Examine, I hope to get answers to these questions.

Sergiy Ukraine

10/12/2009 2:19:18 PM #

Just let me know what questions you have, i'll be happy to answer them!
(I'll try to get a new release created for Examine on CodePlex this week)

ShannonDeminick Australia

10/29/2009 7:53:08 PM #

This seems to be a great extension for Umbraco. Can't get it to work with more than the default index set. Seems like they are not indexed at all.

Then I have been trying to get it to work with the new refactored version as Shannon suggests Sergiy, but can't figure out how to search other sets than the deafult one.
It would be really nice, if we could get the above example documentation updated for the new version.

Dan Christoffersen Denmark

4/29/2010 5:41:23 PM #

Is it possible to fetch all documents? Like MatchAllDocsQuery?

Folkert Netherlands

4/30/2010 2:48:26 AM #

Folkert - Examine has evolved a lot since this post and has become a provider model indexer. Because of this it is agnostic of Lucene.Net and since Document is a Lucene.Net concept we don't expose it.

You can get all the results be enumerating through the query result. I recommend you check out the Codeplex project page for more documentation.

AaronPowell Australia

8/14/2010 4:40:07 AM #

Many thanks to the person who made this post, this was very informative for me. Please continue this awesome work. Sincerely...

Air force one United States

8/24/2010 2:35:23 PM #

Pingback from iskbank.com
Eve Isk, Eve Online

iskbank.com United Kingdom