Using Examine to index & search with ANY data source

by Shannon Deminick 10. August 2010 10:38

During CodeGarden 2010 a few people were asking how to use Examine to index and search on data from any data source such as custom database tables, etc… Previously, the only way to do this was to override the Umbraco Examine indexing provider, remove the Umbraco functionality embedded in there, and then do a lot of coding yourself.  …But now there’s some great news! As of now you can use all of the Examine goodness with it’s embedded Lucene.Net with any data source and you can do it VERY easily.

Some things you need to know about the new version:

  1. I haven’t made a release version of this yet as it still needs some more testing, though we are putting this into a production site next week.
  2. If you want to try this, currently you’ll need to get the latest source from Examine @ CodePlex
  3. If you are using a previous version of Examine, there’s a few breaking changes as some of the class structures have been moved, however you config file should still work as is… HOWEVER, you should update your config file to reflect the new one with the new class names
  4. There is now 3 DLLs, not just 2:
    • Examine.DLL
      • Still pretty much the same… contains the abstraction layer
    • Examine.LuceneEngine.DLL
      • The new DLL to use to work with data that is not Umbraco specific
    • UmbracoExamine.DLL
      • The DLL that the Umbraco providers are in

Ok, now on to the good stuff. First, I’ve added a demo project to this post which you can download HERE. This project is a simple console app that contains a sample XML data file that has 5 records in it. Here’s what the app does:

  1. This re-indexes all data
  2. Searches the index for node id 1
  3. Ensures one record is found in the index
  4. Updates the dateUpdated time stamp for the data record
  5. Re-indexes the record with node id 1’

So assuming that you have some custom data like a custom database table, xml file, or whatever, there’s really only 3 things that you need to do to get Examine indexing your custom data:

  1. Create your own ISimpleDataService
    • There is only 1 method to implement: IEnumerable<SimpleDataSet> GetAllData(string indexType)
    • This is the method that Examine will call to re-index your data
    • A SimpleDataSet is a simple object containing a Dictionary<string, string> and a IndexedNode object (which consists of a Node Id and a Node Type)
    • For example, if you had a database row, your SimpleDataSet object for the row would be the dictionary of the rows values, it’s node id and type … easy.
  2. Use the ToExamineXml() extension method to re-index individual nodes/records
    • Examine relies on data being in the same XML structure as Umbraco (which we might change in version 2 sometime in the future… like next year) so we need to transform simple data into the XML structure. We’ve made this quite easy for you; all you have to do is get the data from your custom data source into a Dictionary<string, string> object and use this extension method to pass the xml structure in to Examine’s ReIndexNode method.
    • For example: ExamineManager.Instance.ReIndexNode(dataSet.ToExamineXml(dataSet["Id"], "CustomData"), "CustomData");  where dataSet is a Dictionary<string, string> .
  3. Update your Examine config to use the new SimpleDataIndexer index provider and the new LuceneSearcher search provider

If you’re not using Umbraco at all, then you’ll only need to have the 2 Examine DLLs which don’t reference the Umbraco DLLs whatsoever so everything is decoupled.

I’d recommend downloading the demo app and running it as it will show you everything you need to know on how to get Examine running with custom data. However, i know that people just like to see code in blog posts, so here’s the config for the demo app:

<?xml version="1.0" encoding="utf-8" ?> <configuration> <configSections> <section name="Examine" type="Examine.Config.ExamineSettings, Examine"/> <section name="ExamineLuceneIndexSets" type="Examine.LuceneEngine.Config.IndexSets, Examine.LuceneEngine"/> </configSections> <Examine> <ExamineIndexProviders> <providers> <!-- Define the indexer for our custom data. Since we're only indexing one type of data, there's only 1 indexType specified: 'CustomData', however if you have more than one type of index (i.e. Media, Content) then you just need to list them as a comma seperated list without spaces. The dataService is how Examine queries whatever data source you have, in this case it's a custom data service defined in this project. A custom data service only has to implement one method... very easy. --> <add name="CustomIndexer" type="Examine.LuceneEngine.Providers.SimpleDataIndexer, Examine.LuceneEngine" dataService="ExamineDemo.CustomDataService, ExamineDemo" indexTypes="CustomData" runAsync="false"/> </providers> </ExamineIndexProviders> <ExamineSearchProviders defaultProvider="CustomSearcher"> <providers> <!-- A search provider that can query a lucene index, no other work is required here --> <add name="CustomSearcher" type="Examine.LuceneEngine.Providers.LuceneSearcher, Examine.LuceneEngine" /> </providers> </ExamineSearchProviders> </Examine> <ExamineLuceneIndexSets> <!-- Create an index set to hold the data for our index --> <IndexSet SetName="CustomIndexSet" IndexPath="App_Data\CustomIndexSet"> <IndexUserFields> <add Name="name" /> <add Name="description" /> <add Name="dateUpdated" /> </IndexUserFields> </IndexSet> </ExamineLuceneIndexSets> </configuration>
Categories: .Net | Examine | Umbraco

Comments

8/10/2010 11:05:25 AM #

Pingback from topsy.com

Twitter Trackbacks for
        
        FARMCode.org | Using Examine to index & search with ANY data source
        [farmcode.org]
        on Topsy.com

topsy.com

8/10/2010 6:59:46 PM #

Could this be used to index and search a table of say members or customers and search across multiple columns?  For instance I want to search for "James" which might be a first name or a last name or a company name or...

Connie DeCinko United States

8/11/2010 2:34:38 AM #

Yes, and Examine already has that built into it, it's not a specific feature for this version

AaronPowell Australia

8/12/2010 12:27:14 AM #

Great example, it'll really help integrate it with other data sources. Thanks

Chris United Kingdom

8/12/2010 12:15:08 PM #

How to build a search query in Examine

How to build a search query in Examine

FARMCode.org

10/11/2010 10:00:38 AM #

Is there a way to search multiple indexes with the same query? In Umbraco I have one index for the cms-managed content and one for a bunch of PDF files and I'd like the same query to look at both.

Seems like this would be another option to ExamineManager.Instance.SearchProviderCollection

Thanks for a great product!

Andrew Waegel United States

12/29/2010 6:16:47 PM #

Hello Shannon

I'm trying to add the SimpleDataIndexer to my already working Umbraco instalation, but after setting up .config files I get the following exception when the application is started:

System.Reflection.TargetInvocationException: An exception occurred in the target of the invocation. ---> System.TypeInitializationException: An exception occurred in the type initializer  'search. ---> System.TypeInitializationException: An exception occurred in the type initializer 'Examine.ExamineManager'. ---> System.Configuration.ConfigurationErrorsException: Value cannot be null.
Parameter name: type (\config\ExamineSettings.config line 13)
at System.Web.Configuration.ProvidersHelper.InstantiateProvider(ProviderSettings providerSettings, Type providerType)
at System.Web.Configuration.ProvidersHelper.InstantiateProviders(ProviderSettingsCollection configProviders, ProviderCollection providers, Type providerType)
at Examine.ExamineManager.LoadProviders()
at Examine.ExamineManager..ctor()
at Examine.ExamineManager..cctor()
--- End of inner exception stack trace ---
at Examine.ExamineManager.get_Instance()
at search.RN..cctor() at \App_Code\RN.cs:línea 20
--- End of inner exception stack trace ---
--- End of inner exception stack trace ---  
at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandle& ctor, Boolean& bNeedSecurityCheck)
at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean fillCache)
at System.RuntimeType.CreateInstanceImpl(Boolean publicOnly, Boolean skipVisibilityChecks, Boolean fillCache)
at System.Activator.CreateInstance(Type type, Boolean nonPublic)
at System.Activator.CreateInstance(Type type)
at umbraco.macro.GetXsltExtensions()


My config/ExamineSettings.config file looks like:

<providers>
      <!-- The next one is line 13 that throw excpetion -->
      <add name="CustomIndexer" type="Examine.LuceneEngine.Providers.SimpleDataIndexer, Examine.LuceneEngine"
        dataService="ExamineDemo.CustomDataService, ExamineDemo"
        indexTypes="CustomData"
        enableDefaultEventHandler="false"
        runAsync="false"/>
</providers>

"CustomData" is defined inside config/ExamineIndex.config file.My Umbraco version is 4.5.4, Windows 7, IIS 7.5.

Any idea on what could be going on?
Regards,
Jorge

Jorge Spain

1/4/2011 7:05:25 PM #

Hi Shannon, I think there is a minor mistake in your demo project: in file CustomDataService.cs on line 79, you're updating the "id" attribute of the node, while I think you would want to update the "dateUpdated" attribute. Doesn't impact the demo in any way, but still incorrect.

BTW, I can't thank you enough for creating something (Examine) I've been wanting for two years, but haven't been able to create due to sheer incompetence :) Now I can!

Thijs Kuipers Netherlands

1/25/2011 2:07:59 PM #

Hi.
I tryed the example, it works fine. But I want to call the dataService from another project, is this possible?

Henrik Madsen Denmark

2/21/2011 10:50:06 AM #

Hi,

In the example, I can reindex node, but do you know
how to delete the node ?

Thanks for your help.

Kurniawan

Kurniawan Australia

2/21/2011 10:52:18 AM #

you can just use the API to delete an entry from the index

ShannonDeminick United States

2/21/2011 4:36:47 PM #

there is deleteNodeFromIndex(nodeId)

but the problem i can't dlete by nodeid and type like whn you do reindex
becuase my customdata will index 10 different table
so it will have same id from different table.

do you think i have to saperate them into 10 different index to
be able to use delete by node id?

thanks

kurniawan United States

2/22/2011 2:50:02 AM #

You ID should be unique to the whole index. If you are adding data from 10 different tables, then you need to ensure that the ID is unique. In your data provider, you could just return TableName_ID as the ID

ShannonDeminick United States

2/22/2011 3:51:42 AM #

I want to put the id is TableName_ID

But in your example, using SimpleDataSet
is using (integer) id.

//add a new SimpleDataSet object to the list
                data.Add(new SimpleDataSet()
                {
                    //create the node definition, ensure that it is the same type as referenced in the config
                    NodeDefinition = new IndexedNode()
                    {
                        NodeId = item.ProductID, //INTEGER
                        Type = TreeAlias.product.ToString()
                    },
                

Do you think I should recreate the simple data set or I just need to saperate them into different index ?

Thanks for your help.

Kurniawan Australia

3/24/2011 4:03:01 PM #

I have problems applying this example to a working installation of umbraco 4.7. I am trying to apply to real SQLData.
However I cannot find what exactly the GetAllData function should do (it is not implmented in the example?).
There are also problems with the object namespaces using the most recent examine releases (1.0 RTM).
I get cryptic errors in the umbraco log ...
Is there an updated example showing more details?

Carsten Denmark

3/24/2011 4:16:49 PM #

Get all data will return all data that u want to be indexed.

for namespace, make sure you use from the same version as yours in svn

kurniawan Australia

3/25/2011 2:08:35 AM #

It would be helpful if you could let us know what these "crypic errors" in the umbraco log are...

ShannonDeminick Australia

3/25/2011 12:38:44 PM #

@Shannon

The cryptic messages actually related to a combination of wrong namespaces and errors in my own asp.net code ;">
I have gotten it to work :-D - but have additional problems/questions:
I am indexing a SQL news database with currently 55.000 records growing daily with about 10 records. Does Examine need the entire dataset (examine.LuceneEngine.SimpleDataSet) in memory?
And can I "reload" the already indexed index-files after an umbraco-restart instead of going back to the SQL-dattabase for ALL the data and reindexing it from the start again?

Carsten Denmark

3/25/2011 1:00:35 PM #

You just need to reload all index just one for 55.000.
If there additional growing data daily, you can just add new index for the new data only.

If there is update, then you can just reindex the updated data only.


For reindex specific data or add new specific data, here is the example.
ExamineManager.Instance.IndexProviderCollection[indexProvider].ReIndexNode(dataSet.RowData.ToExamineXml(dataSet.NodeDefinition.NodeId, ExamineSearchDataType.All), ExamineSearchDataType.All);


if there is delete, then you just delete the index for particular data.
ExamineManager.Instance.IndexProviderCollection[indexProvider].DeleteFromIndex(nodeID.ToString());

You don't need to reindex all data everytime you restart your application, because the index is stored on your local machine (not in memory).
which by default is \App_Data\TEMP\ExamineIndexes

Kurniawan Australia

3/25/2011 1:02:37 PM #

Sorry for mistype
I mean "You just need to index (55.000 records) once.

Kurniawan Australia

3/25/2011 1:18:11 PM #

@Kurniawan
Thanks :-)
But I will have to load the entire SQL table into the dataSet and keep track of which have been indexed and which haven't in my own code?
How do I access information about which nodes have been indexed?
Is the already stored indexes autoloaded at restart and can I count the number of already indexed rows?
I have tried to dive into the documentation but can't really find much...

Carsten Denmark

3/25/2011 1:30:53 PM #


You don't need to load entire SQL table into dataset.

In the begining, you can just run the reindex all data (55000).
and after that you can use an event handler to monitoring your index.

For example, before after adding new data, resave data and delete data, your event handler should trigger the reindexing part.


The index is always there as long as \App_Data\TEMP\ExamineIndexes is there since your first reindex.
and all the search result will read this index to give you the search result.

Please have a look at Umbraco source code, How they monitoring the index.
Check the Application Base event handler, when it publish, it will trigger to reindex the individual node or even do crawling to child node.

Kurniawan Australia

3/25/2011 1:55:54 PM #

Browsed the codeplex umbraco - huge source - could you be more specific as to where the Application Base Event Handler can be found?

My road forward (when taking into account the specfics of my source data) seems to be to count the number of currently indexed nodes and check to see if new/updated source datarows needs to be added.

Carsten Denmark

3/25/2011 2:27:47 PM #

you can just search all class which inherits ApplicationBase.
type this on search box at visualstudio
 : ApplicationBase through all solutions.

you don't need to check the count. just monitor it through event handler 
once you delete update or add, reindex that data

kurniawan United States

5/11/2011 7:53:18 PM #

I'm lost as to how to integrate this with an existing Umbraco web site.  Do I add my provider information to my existing Examine___.config files? or do I keep everything seperate since it's being used as part of a custom usercontrol?  Do you have an example using a web form?  Can I add this code to my existing .NET project containing all my site's custom usercontrols?

Connie DeCinko United States

5/15/2011 6:25:44 AM #

yes of course you have to add it to the existing config file... its a standard .Net configuration section, you can only have one.

regardless of user control, any c# code or whatever, it the same. Just create a custom provider, add it to your config, then re-index your index with the ExamineManager. Did you download the demo?

Shannon Australia

5/16/2011 7:08:25 PM #

I added to the config file and it appears that I am getting the collection created.  Now, I just can't seem to figure out how to populate the collection.  The demo was helpful to a certain extent, just not sure how to do it with SQL instead of XML.

Here is what I left off with for my dataservice.cs:

using System;
using System.Collections.Generic;
using System.Configuration;
using System.Data;
using System.Data.SqlClient;
using System.Linq;
using System.Text;
using Examine;
using Examine.LuceneEngine;

namespace SBA.AZBar.UserControls
{
  /// <summary>
  /// The data service used by the LuceneEngine in order for it to reindex all data
  /// </summary>
  public class MemberFinderDataService : Examine.LuceneEngine.ISimpleDataService
  {
    /// <summary>
    /// loads the data source into memory
    /// </summary>
    static MemberFinderDataService()
    {
      SqlCommand cmd = new SqlCommand();

      StringBuilder sql = new StringBuilder();

      sql.Append("SELECT * FROM Populate_Search");

      cmd.CommandText = sql.ToString();

      SqlConnection con = new SqlConnection(ConfigurationManager.ConnectionStrings["AZBar_DOTW_Membership"].ToString());
      cmd.Connection = con;
      con.Open();

      DataSet ds = new DataSet();
      SqlDataAdapter sqlAdap = new SqlDataAdapter(cmd);
      sqlAdap.Fill(ds);

      DataTable dt = ds.Tables[0];  

      con.Close();

    }

    private static DataTable dt;

    /// <summary>
    /// Returns a list of type SimpleDataSet based on the Populate_Search view query
    /// </summary>
    /// <param name=indexType"></param>
    /// <returns></returns>
    public IEnumerable<SimpleDataSet> GetAllData(string indexType)
    {
      var data = new List<SimpleDataSet>();

      //open the datatable and iterate the rows
      foreach (DataRow row in dt.Rows)
      {
        // add a new SimpleDataSet object to the list

        data.Add(new SimpleDataSet()
        {
          // create the node definition, ensure that it is the same type as referenced in the config
          NodeDefinition = new IndexedNode()
          {
            NodeId = (int)row["id"],
            Type = "MemberFinderData"
          },
          // add the data to the row
          RowData = new Dictionary<string, string>()
          {
            {"id", (string)row["id"]},
            {"name_first", (string)row["name_first"]},
            {"name_middle", (string)row["name_middle"]},
            {"name_last", (string)row["name_last"]}
          }
        });
      }

      return data;
    }


  }
}

Connie DeCinko United States

6/28/2011 8:34:49 PM #

The sample shows searching for node id 1.  How do we perform a real search?  How do I specify search words and pass them as search criteria and get results back?

Connie DeCinko United States

6/30/2011 1:44:05 AM #

When can we expect the rest of the documentation?  In looking at the hints in VStudio there appear to be many undocumented options.  Looking at the Lucene docs does not really help as the syntax is not the same.

I've finally been able to create the collection but now am unable to search for phrases within my collection.

Connie DeCinko United States

7/25/2011 10:46:59 PM #

How do save the results to a Datatable?  I need to use it as a datasource for my sortable gridview.

Connie DeCinko United States