/

Apache Solr Removing Data Import Handler

Apache Solr Removing Data Import Handler

Urgent decision needed to replace Apache Solr Data Import Handler (DIH)

Apache Solr’s Data Import Handler (DIH) is about to be deprecated in Solr 8.6 and removed completely from the Solr distribution in 9.0.

DIH provides a mechanism for importing content from a data store and indexing it.  Those users that rely on it for production systems are faced with an urgent decision on whether to stay on a version of Solr that still uses it and falling behind in terms of Solr features, or to embrace the change and implement something else sooner rather than later.

Here is the official community notice:

JIRA TICKET : https://issues.apache.org/jira/browse/SOLR-14066

DIH is being replaced by a community contribution which could have certain limitations

For example, it states it only ships with MariaDB driver.  Theoretically it should be possible to manually configure a JDBC driver for other sources, but this is not a well-trodden path yet.  One other caution is that it has also not had any activity for nearly 6 months, and so has not been updated to track the latest version (Solr 8.11).

There is a clear, uneasy risk that the project will not be maintained by the community, and I’m sure that the Solr contributors making this decision hope that perhaps there will be more activity when DIH is actually removed from Solr releases, again a risky strategy.

Since the primary reason for DIH being removed is part of a big security revamp in Solr it may not be replaced by anything comparable soon, or at all.  Therefore, this represents a risk, and it is advisable for in production users of it to look at alternative methods of populating the search index.

Options for replacing Solr Data Import Hander (DIH)

  1. The official DIH community replacement.  This is likely to be a snapshot of the Apache DIH maintained by a small number of individuals, but it is unclear to what extent it will provide like for like compatibility with the existing DIH and there is doubt about the appetite to keep it maintained.
  2. Open source ETL tool.  There are several integration tools available including Apache Camel and Apache Nifi which could be used to replace DIH.  Camel is a set of java integration libraries whereas Nifi is more of a platform with a user interface etc.  They are often used in conjunction with a queue such as Kafka/JMS/ActiveMQ.
  3. Commercial ETL tool.  There are too many to list here, but most ETL tools have the ability to push to a REST end point and could perform the transformations currently performed by DIH.  Some vendors have specific Solr emitters.
  4. Custom publisher application.  E.g. Build it yourself.
  5. Switch to Elasticsearch or the new Amazon OpenSearch and use the Logstash JDBC driver.  This would require a substantial amount of re-work and so should not be undertaken lightly.
  6. Pureinsights Data Platform (PDP), our cloud native, ultra-scalable content processing platform developed to help organisations improve any search system to just “make it work like Google”.  PDP includes a RDBMS connector, Website Crawler, Filesystem Crawler and a Solr Hydrator as well as adding a lot of Google like features to Solr like Featured Snippets, FAQs, Knowledge Graph Answers (via NEO4J) and facet snapping.It also provides a deep set of metrics about the current relevance effectiveness that is called Engine Scoring which is a fully automated and scientifically objective. Check out this link for more information.

What should I do next?

If you’re using DIH in production systems, now is the time to start considering options to mitigate against its deprecation in coming versions of Solr.  It’s an unwanted distraction, but something that needs careful consideration, nonetheless.

If you need help making this decision, check out our Solr consulting services or CONTACT US for a free 1:1 consultation.

Cheers,

Phil

Twitter
LinkedIn

Stay up to date with our latest insights!