Skip to content

Solr

Attention: The Solr index might experience delays and should not be used as source of truth for critical operations.

Introduction and activation

Whenever a EntryStore entry is created, modified or removed in EntryStore, events are created which are caught by listeners taking care of the procedures in the Solr index. If needed, the Solr index can be recreated during the startup of EntryStore. The Solr-related parameters in the EntryStore configuration are:

entrystore.solr=on
entrystore.solr.reindex-on-startup=off
entrystore.solr.reindex-on-startup.wait=off # if on: blocks the startup until reindexing is complete
entrystore.solr.url=http://localhost:8080/solr/entrystore-core1
entrystore.solr.max-limit=200 (default: 100)
entrystore.solr.auth.username=<username when connecting via HTTP, leave empty if no authentication is necessary>
entrystore.solr.auth.password=<password when connecting via HTTP>

If the URL is a local path an embedded Solr-server (SolrJ) is used (Solr via HTTP is recommended by the Lucene team):

entrystore.solr.url=/srv/entrystore/solr/

Using custom Solr configurations

EntryStore comes with a default Solr configuration consisting of solrconfig.xml and schema.xml (both file names contain a trailing _default in the source repository) which the Solr Core is initialized with.

It is possible to override the locations of these configuration files e.g. to provide an own schema configuration. The files are loaded upon startup of EntryStore and copied to the local folder of the Solr Core. Keep in mind that also a custom schema must provide field definitions that matches the indexed information (see further down).

entrystore.solr.schema.url=https://your-server.tld/schema.xml
entrystore.solr.config.url=https://your-server.tld/solrconfig.xml

The values must be URLs with http, https, or file schema. Redirects are not followed.

In general it is recommended to stick to EntryStore's default Solr configuration.

Configuration of Solr HTTP server

In case an own Solr-server is set up for EntryStore, the supplied schema.xml file has to be used. For the rest (e.g. solrconfig.xml etc) the example configuration in example/solr/collection1/conf in the Solr distribution-tar.gz can be reused without modification.

Indexed information

The following pieces of information of an entry are indexed in Solr (see also schema.xml) and can therefore be used as search parameters (they are case-sensitive!):

  • title: All titles (DC and DC terms) in all languages, including FOAF and VCard names.
  • title.lang: To facilitate sorting after titles a dynamic field is created with the pattern title.[two-letter language code according to ISO 639-1]. No multi value support because these fields are intended to be used in the context of sorting. Example: title.en for a title in English, title.nolang for a title without language set.
  • description: All descriptions in all languages.
  • tag.literal: All literal tags in all languages. Covers common tag predicates such as dc:subject and dcterms:subject.
  • tag.literal.lang: If a tag has a language set, a dynamic field is created with the pattern tag.literal.lang, e.g., tag.literal.en for all literal tags in English.
  • tag.uri: All resource tags. Covers common predicates such as dc:subject and dcterms:subject.
  • metadata.predicate.literal.<md5>, metadata.predicate.uri.<md5>: Used to carry out queries for exact predicate-object combinations. Due to restrictions in Solr it was necessary to shorten the predicate URI, which was done using an MD5 hash (hex) of the predicate URI truncated after 8 characters. Collisions are unlikely and in case of occurence non-fatal. The field value is the string value of the object. The field names are different for URI and literal objects due to different indexing strategies in Solr for the respective object type (literals are indexed using "text_ngram" and URIs are indexed using "string").
  • metadata.predicate.literal_s.<md5>: same as metadata.predicate.literal.<md5> (without _s) above, but indexed as "string" instead of "text_ngram".
  • metadata.predicate.literal_t.<md5>: same as metadata.predicate.literal.<md5> (without _t) above, but indexed as "text" instead of "text_ngram".
  • metadata.predicate.date.<md5>: same as metadata.predicate.literal.<md5> above, but indexed as "date" instead of "text", enabling date-specific queries such as range queries etc.
  • metadata.predicate.integer.<md5>: same as metadata.predicate.literal.<md5> above, but special treatment for integer literals. This field is single-valued to allow for sorting and covers all numerical integer datatypes (including e.g. short, byte, long, etc). Integers are indexed using Solr's "slong" type.
  • related.metadata.predicate.*.<md5>: same as metadata.predicate.*.<md5> above, but based on the related property index as described further down on this page.
  • metadata.object.literal: All values of string literals (the datatype must not be specified), independent of language.
  • metadata.object.uri: All object URIs.
  • lang: The language of the resource, fetched from dc and dcterms:language. Used for searches.
  • all: catch-all Solr field, containing title, description, tags from above. This field is used if no search property is provided in the Solr query.
  • uri: Entry URI.
  • resource: Resource URI.
  • rdfType: The RDF type of the resource, fetched from the entry and the metadata graph.
  • context: Resource URI of the entry's surrounding context.
  • contextname: Name of the context if the indexed entry is a context or system context. This field is not used for regular entries.
  • creator: Creator URI.
  • contributors: URIs of contributors.
  • lists: Resource URIs of referring lists.
  • public: true if the entry metadata is readable by the guest user. Warning: if a context's ACL is updated, the context's entries are not re-indexed. As a consquence, the public field may be outdated and incorrect if the context is not re-indexed manually after ACL changes. Do not this field if you do not fully understand how it works.
  • created: Creation date.
  • modified: Modification date.
  • graphType: the builtin type, case sensitive (i.e. Context, SystemContext, User, Group, List, ResultList, String, Graph, Pipeline, None).
  • entryType: the location type, case sensitive (i.e. Local, Link, LinkReference, Reference).
  • resourceType: the resource type, case sensitive (i.e. InformationResource, ResolvableInformationResource, NamedResource, Unknown).
  • acl.admin: URIs of principals with admin rights (explicitly set in entry info).
  • acl.metadata.r: URIs of principals with read rights on metadata (explicitly set in entry info).
  • acl.metadata.rw: URIs of principals with read/write rights on metadata (explicitly set in entry info).
  • acl.resource.r: URIs of principals with read rights on the resource (explicitly set in entry info).
  • acl.resource.rw: URIs of principals with read/write rights on the resource (explicitly set in entry info).
  • profile: URI of the entry's application profile. Originates from the entry graph and consists of the object value of the predicate http://entrystore.org/terms/profile (the deprecated predicate http://entryscape.com/terms/entityType is supported for backwards compatibility).

The following fields support incremental/partial matching through NGram (min=3, max=25) on the index and Edge NGram (min=3, max=25) on the query:

  • title
  • title.lang
  • description
  • tag.literal
  • tag.literal.lang
  • metadata.predicate.literal.*
  • related.metadata.predicate.literal.*

CAVEAT: all colons in URI parameters must be escaped with a back slash, i.e. "\:", otherwise Solr throws an exception.

RDF Format

By default, the Solr search returns all RDF graphs in RDF/JSON. This can be overridden by providing the URL parameter rdfFormat with the alternative value application/ld+json to serialized the RDF graphs using JSON-LD. Optimization takes place to detect the namespaces in use, which are used as JSON-LD contexts.

Solr Query Syntax

The default query field concatenator in EntryStore is OR. This can be overridden in every query, see examples below.

See the following links for more detailed information about the query parser syntax:

EntryStore uses the DisMax request handler.

The following characters that need to be escaped (using \) inside the query parameter because they are part of the Solr query syntax: \+-!():^[]"{}~*?|&;/.

Sorting

The parameter sort can be used for Solr-style sorting, e.g. sort=title.en+asc,modified+desc. The default sorting value is to sort after the score (relevancy) and the modified (last modification date). All text_sort and single value fields can be used for sorting, this basically excludes title, description and keywords, but allows sorting after e.g. title.en. title.* is of type text_ngram, but when used as sorting property it is internally rewritten into title_sort.* which is of type text_sort and ignores case.

Default language

Mostly of relevance for instances that are being used with a language that is not natively supported by the client application (e.g. client supports lang1 and lang2, but the user mostly provides metadata in lang3). In such cases it is possible to configure a default sorting language that will result in literals that match the default language being indexed as title.default and title_sort.default, respectively.

Configuration:

entrystore.solr.default-sorting-lang=<language code> (default: no default language set)

Boosting

In order to increase the relevance (boost) of certain fields it is possible to use the caret symbol ^, which is regular Solr syntax. To boost the title field with factor 10 it is possible to use the following construct: title:someterm^10. Boosting is specific for terms and not for fields, if provided at query time. Boosting at index time is possible, but requires modifications of the Solr configuration.

By default the fields used for sorting are score and modified, so in order to get a search result fully based on relevance it is needed to override the default sort setting by providing score as only sorting parameter: sort:score+desc.

Example

  • http://base/search?type=solr&query=title:organic^50+OR+title:farming^25+tag.literal:organic^10&sort=score+desc

The ^ needs to be URL encoded like the rest of the query. The example above does not encode special characters and URIs for better readability.

Accuracy of result count

Every search result also contains an estimated result count (e.g. to be used with pagination). This number may not reflect the real amount of results which the requesting user has access to. In such cases the result count is always higher than the number of accessible entries.

The reason for this lies in how ACL management is done. The ACL is contained in the triple store, but for performance reasons it cannot be included in the Solr-index. The search is run against the Solr-index which also returns the result count (this is the same number as returned by the EntryStore search resource in the REST API), but the ACL of the matching entries is not looked up before the entries and their metadata are returned in the partial result set (because of pagination).

Pagination

The response size is limited to 50 by default and can be set to any value between 1 and 100 with the url parameter limit. If the result count is greater than the limit it is possible to fetch the next result batch using the offset parameter.

The following URL-parameters which are relevant for pagination are:

  • limit: default 50, can be set to any integer from 1 to 100. The default max limit is 100 which can be overridden using the setting entrystore.solr.max-limit, see above.
  • offset: default 0, determines where the returned result set starts.

Examples

  • http://base/search?type=solr&query=title:organic+AND+builtinType:List&sort=score+desc,modified+desc
  • http://base/search?type=solr&query=organic+AND+lang:de&sort=title.de+desc,modified+desc
  • http://base/search?type=solr&query=organic+OR+ecology+lang:en (this would yield the same result also without OR)
  • http://base/search?type=solr&query=organic+lang:en&limit=10&offset=20
  • http://base/search?type=solr&query=title:*&limit=10&rdfFormat=application/ld+json

Syndication feed

Returning data in the format of an Atom or RSS feed can be enabled by using the parameter syndication, defining the format you want to use. The following values are supported:

  • atom_1.0
  • atom_0.3
  • rss_2.0
  • rss_1.0

You can also use the lang parameter to define preferred language of entries to return. The value should be a 2 character code following the ISO 639-1 specification ("se", "de", etc.). If omitted, or if an entry in the specified language does not exist, the language will be undetermined.

The following parameters are not supported while using syndication:

  • sort - The stream is always sorted in descending order according to when the entries were posted
  • offset - Pagination is not supported when using syndication

Parameters

  • facetFields: A comma-separated list of fields that should be treated as facets
  • filterQuery: A query to filter the results of a faceted search, e.g. facetX=valueY
  • facetMinCount: Result threshold for facets to be included in query result. Must be an Integer value, default is 1.
  • facetLimit: Maximum amount of facets to be returned. Must be an Integer value, default is 100. Values up to 1000 are allowed. Higher limits can be allowed by setting entrystore.solr.facet-max-limit in entrystore.properties.
  • facetMatches: To only return facets that match a regular expression.
  • facetMissing: If set to true, a count of all results that match the query, but which have no facet value for the field should be computed and returned in the response. Default is false.

Fields

Fields that are indexed fully and without modification (i.e., field type string" in Solr) may be used as facets; in particular the following fields are intended to be used as facets:

  • metadata.predicate.date.<md5>
  • metadata.predicate.integer.<md5>
  • metadata.predicate.literal.<md5> (this field is of type "text" which EntryStore automatically translates into metadata.predicate.literal_s.<md5> for facetted queries)
  • metadata.predicate.literal_s.<md5>
  • metadata.predicate.uri.<md5>
  • All fields containing a URI
  • All fields named with pattern *Type

Examples

  • http://base/search?type=solr&query=organic&facetFields=metadata.predicate.literal.256bd150&filterQuery=metadata.predicate.literal.256bd150:crop: Use dc:subject as facet and value "crop" as filter.

The related property index contains properties from other entries that the entry's metadata links to. All relations that are to follow must be configured using the following settings:

entrystore.solr.related=on|off (default: off)
entrystore.solr.related.properties.n=<predicate URI>

By default only entries in the same context are indexed. This can be changed per property by appending the "global" parameter to the property configuration. The URI and the parameter need to be separated by a comma:

entrystore.solr.related.properties.n=<predicate URI>,global

Example

entrystore.solr.related=on
entrystore.solr.related.properties.1=http://purl.org/dc/terms/creator,global
entrystore.solr.related.properties.2=http://schema.org/image
entrystore.solr.related.properties.3=http://schema.org/location
entrystore.solr.related.properties.4=http://schema.org/address

Inspecting an entry's index

The owner of an entry can inspect the indexed information of an entry by accessing its index URL which is contructed by appending /index to the end of the entry URI.

Example: https://base/1/entry/1/index

The response body contains a JSON object with all indexed fields and its values.

Rebuilding the index

The index can be rebuilt upon startup (recommended) through a property in the EntryStore settings (see above). Alternatively, admin users (i.e. admin and members of the admin group) and context owners have the possibility to rebuild the Solr index or specific contexts live without restart through the management API using the following request:

POST http://base/management/solr

Content-Type: application/json

{
  "command": "reindex",
  "context": "https://base/_contexts/entry/context-id"
}

If the context parameter is omitted, the whole repository will be reindexed.

See the Swagger documentation for further details.

Automatic reindexing of contexts

If the ACL of a context is updated in a way that changes the context's (and its entries') metadata readability (public/non-public via the _guest user), the context is automatically scheduled for reindexing. There is a built-in delay of 10 seconds before reindexing start in case the client/user toggles the _guest readability within a short time frame.