Solr¶
Attention: The Solr index might experience delays and should not be used as source of truth for critical operations.
Introduction and activation¶
Whenever a EntryStore entry is created, modified or removed in EntryStore, events are created which are caught by listeners taking care of the procedures in the Solr index. If needed, the Solr index can be recreated during the startup of EntryStore. The Solr-related parameters in the EntryStore configuration are:
entrystore.solr=on
entrystore.solr.reindex-on-startup=off
entrystore.solr.reindex-on-startup.wait=off # if on: blocks the startup until reindexing is complete
entrystore.solr.url=http://localhost:8080/solr/entrystore-core1
entrystore.solr.max-limit=200 (default: 100)
entrystore.solr.auth.username=<username when connecting via HTTP, leave empty if no authentication is necessary>
entrystore.solr.auth.password=<password when connecting via HTTP>
If the URL is a local path an embedded Solr-server (SolrJ) is used (Solr via HTTP is recommended by the Lucene team):
entrystore.solr.url=/srv/entrystore/solr/
Using custom Solr configurations¶
EntryStore comes with a default Solr configuration consisting of solrconfig.xml
and schema.xml
(both file names contain a trailing _default
in the source repository) which the Solr Core is initialized with.
It is possible to override the locations of these configuration files e.g. to provide an own schema configuration. The files are loaded upon startup of EntryStore and copied to the local folder of the Solr Core. Keep in mind that also a custom schema must provide field definitions that matches the indexed information (see further down).
entrystore.solr.schema.url=https://your-server.tld/schema.xml
entrystore.solr.config.url=https://your-server.tld/solrconfig.xml
The values must be URLs with http, https, or file schema. Redirects are not followed.
In general it is recommended to stick to EntryStore's default Solr configuration.
Configuration of Solr HTTP server¶
In case an own Solr-server is set up for EntryStore, the supplied schema.xml
file has to be used. For the rest (e.g. solrconfig.xml
etc) the example configuration in example/solr/collection1/conf
in the Solr distribution-tar.gz can be reused without modification.
Indexed information¶
The following pieces of information of an entry are indexed in Solr (see also schema.xml) and can therefore be used as search parameters (they are case-sensitive!):
title
: All titles (DC and DC terms) in all languages, including FOAF and VCard names.title.lang
: To facilitate sorting after titles a dynamic field is created with the patterntitle.[two-letter language code according to ISO 639-1]
. No multi value support because these fields are intended to be used in the context of sorting. Example:title.en
for a title in English,title.nolang
for a title without language set.description
: All descriptions in all languages.tag.literal
: All literal tags in all languages. Covers common tag predicates such as dc:subject and dcterms:subject.tag.literal.lang
: If a tag has a language set, a dynamic field is created with the pattern tag.literal.lang, e.g., tag.literal.en for all literal tags in English.tag.uri
: All resource tags. Covers common predicates such as dc:subject and dcterms:subject.metadata.predicate.literal.<md5>
,metadata.predicate.uri.<md5>
: Used to carry out queries for exact predicate-object combinations. Due to restrictions in Solr it was necessary to shorten the predicate URI, which was done using an MD5 hash (hex) of the predicate URI truncated after 8 characters. Collisions are unlikely and in case of occurence non-fatal. The field value is the string value of the object. The field names are different for URI and literal objects due to different indexing strategies in Solr for the respective object type (literals are indexed using "text_ngram" and URIs are indexed using "string").metadata.predicate.literal_s.<md5>
: same asmetadata.predicate.literal.<md5>
(without_s
) above, but indexed as "string" instead of "text_ngram".metadata.predicate.literal_t.<md5>
: same asmetadata.predicate.literal.<md5>
(without_t
) above, but indexed as "text" instead of "text_ngram".metadata.predicate.date.<md5>
: same asmetadata.predicate.literal.<md5>
above, but indexed as "date" instead of "text", enabling date-specific queries such as range queries etc.metadata.predicate.integer.<md5>
: same asmetadata.predicate.literal.<md5>
above, but special treatment for integer literals. This field is single-valued to allow for sorting and covers all numerical integer datatypes (including e.g. short, byte, long, etc). Integers are indexed using Solr's "slong" type.related.metadata.predicate.*.<md5>
: same asmetadata.predicate.*.<md5>
above, but based on the related property index as described further down on this page.metadata.object.literal
: All values of string literals (the datatype must not be specified), independent of language.metadata.object.uri
: All object URIs.lang
: The language of the resource, fetched from dc and dcterms:language. Used for searches.all
: catch-all Solr field, containing title, description, tags from above. This field is used if no search property is provided in the Solr query.uri
: Entry URI.resource
: Resource URI.rdfType
: The RDF type of the resource, fetched from the entry and the metadata graph.context
: Resource URI of the entry's surrounding context.contextname
: Name of the context if the indexed entry is a context or system context. This field is not used for regular entries.creator
: Creator URI.contributors
: URIs of contributors.lists
: Resource URIs of referring lists.public
: true if the entry metadata is readable by the guest user. Warning: if a context's ACL is updated, the context's entries are not re-indexed. As a consquence, thepublic
field may be outdated and incorrect if the context is not re-indexed manually after ACL changes. Do not this field if you do not fully understand how it works.created
: Creation date.modified
: Modification date.graphType
: the builtin type, case sensitive (i.e. Context, SystemContext, User, Group, List, ResultList, String, Graph, Pipeline, None).entryType
: the location type, case sensitive (i.e. Local, Link, LinkReference, Reference).resourceType
: the resource type, case sensitive (i.e. InformationResource, ResolvableInformationResource, NamedResource, Unknown).acl.admin
: URIs of principals with admin rights (explicitly set in entry info).acl.metadata.r
: URIs of principals with read rights on metadata (explicitly set in entry info).acl.metadata.rw
: URIs of principals with read/write rights on metadata (explicitly set in entry info).acl.resource.r
: URIs of principals with read rights on the resource (explicitly set in entry info).acl.resource.rw
: URIs of principals with read/write rights on the resource (explicitly set in entry info).profile
: URI of the entry's application profile. Originates from the entry graph and consists of the object value of the predicatehttp://entrystore.org/terms/profile
(the deprecated predicatehttp://entryscape.com/terms/entityType
is supported for backwards compatibility).
The following fields support incremental/partial matching through NGram (min=3, max=25) on the index and Edge NGram (min=3, max=25) on the query:
title
title.lang
description
tag.literal
tag.literal.lang
metadata.predicate.literal.*
related.metadata.predicate.literal.*
CAVEAT: all colons in URI parameters must be escaped with a back slash, i.e. "\:", otherwise Solr throws an exception.
RDF Format¶
By default, the Solr search returns all RDF graphs in RDF/JSON. This can be overridden by providing the URL parameter rdfFormat
with the alternative value application/ld+json
to serialized the RDF graphs using JSON-LD. Optimization takes place to detect the namespaces in use, which are used as JSON-LD contexts.
Solr Query Syntax¶
The default query field concatenator in EntryStore is OR. This can be overridden in every query, see examples below.
See the following links for more detailed information about the query parser syntax:
- https://lucene.apache.org/solr/guide/8_5/the-standard-query-parser.html
- https://lucene.apache.org/solr/guide/8_5/the-dismax-query-parser.html
- https://lucene.apache.org/core/8_5_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html
- https://cwiki.apache.org/confluence/display/solr/LocalParams
EntryStore uses the DisMax request handler.
The following characters that need to be escaped (using \
) inside the query
parameter because they are part of the Solr query syntax: \+-!():^[]"{}~*?|&;/
.
Sorting¶
The parameter sort
can be used for Solr-style sorting, e.g. sort=title.en+asc,modified+desc
. The default sorting value is to sort after the score
(relevancy) and the modified
(last modification date). All text_sort
and single value fields can be used for sorting, this basically excludes title, description and keywords, but allows sorting after e.g. title.en
. title.*
is of type text_ngram
, but when used as sorting property it is internally rewritten into title_sort.*
which is of type text_sort
and ignores case.
Default language¶
Mostly of relevance for instances that are being used with a language that is not natively supported by the client application (e.g. client supports lang1
and lang2
, but the user mostly provides metadata in lang3
). In such cases it is possible to configure a default sorting language that will result in literals that match the default language being indexed as title.default
and title_sort.default
, respectively.
Configuration:
entrystore.solr.default-sorting-lang=<language code> (default: no default language set)
Boosting¶
In order to increase the relevance (boost) of certain fields it is possible to use the caret symbol ^
, which is regular Solr syntax. To boost the title field with factor 10 it is possible to use the following construct: title:someterm^10
. Boosting is specific for terms and not for fields, if provided at query time. Boosting at index time is possible, but requires modifications of the Solr configuration.
By default the fields used for sorting are score
and modified
, so in order to get a search result fully based on relevance it is needed to override the default sort setting by providing score
as only sorting parameter: sort:score+desc
.
Example¶
http://base/search?type=solr&query=title:organic^50+OR+title:farming^25+tag.literal:organic^10&sort=score+desc
The ^
needs to be URL encoded like the rest of the query. The example above does not encode special characters and URIs for better readability.
Accuracy of result count¶
Every search result also contains an estimated result count (e.g. to be used with pagination). This number may not reflect the real amount of results which the requesting user has access to. In such cases the result count is always higher than the number of accessible entries.
The reason for this lies in how ACL management is done. The ACL is contained in the triple store, but for performance reasons it cannot be included in the Solr-index. The search is run against the Solr-index which also returns the result count (this is the same number as returned by the EntryStore search resource in the REST API), but the ACL of the matching entries is not looked up before the entries and their metadata are returned in the partial result set (because of pagination).
Pagination¶
The response size is limited to 50 by default and can be set to any value between 1 and 100 with the url parameter limit
. If the result count is greater than the limit it is possible to fetch the next result batch using the offset
parameter.
The following URL-parameters which are relevant for pagination are:
limit
: default 50, can be set to any integer from 1 to 100. The default max limit is 100 which can be overridden using the settingentrystore.solr.max-limit
, see above.offset
: default 0, determines where the returned result set starts.
Examples¶
http://base/search?type=solr&query=title:organic+AND+builtinType:List&sort=score+desc,modified+desc
http://base/search?type=solr&query=organic+AND+lang:de&sort=title.de+desc,modified+desc
http://base/search?type=solr&query=organic+OR+ecology+lang:en
(this would yield the same result also without OR)http://base/search?type=solr&query=organic+lang:en&limit=10&offset=20
http://base/search?type=solr&query=title:*&limit=10&rdfFormat=application/ld+json
Syndication feed¶
Returning data in the format of an Atom or RSS feed can be enabled by using the parameter syndication
, defining the format you want to use. The following values are supported:
atom_1.0
atom_0.3
rss_2.0
rss_1.0
You can also use the lang
parameter to define preferred language of entries to return. The value should be a 2 character code following the ISO 639-1 specification ("se
", "de
", etc.). If omitted, or if an entry in the specified language does not exist, the language will be undetermined.
The following parameters are not supported while using syndication:
sort
- The stream is always sorted in descending order according to when the entries were postedoffset
- Pagination is not supported when using syndication
Faceted search¶
Parameters¶
facetFields
: A comma-separated list of fields that should be treated as facetsfilterQuery
: A query to filter the results of a faceted search, e.g.facetX=valueY
facetMinCount
: Result threshold for facets to be included in query result. Must be an Integer value, default is 1.facetLimit
: Maximum amount of facets to be returned. Must be an Integer value, default is100
. Values up to1000
are allowed. Higher limits can be allowed by settingentrystore.solr.facet-max-limit
in entrystore.properties.facetMatches
: To only return facets that match a regular expression.facetMissing
: If set totrue
, a count of all results that match the query, but which have no facet value for the field should be computed and returned in the response. Default isfalse
.
Fields¶
Fields that are indexed fully and without modification (i.e., field type string" in Solr) may be used as facets; in particular the following fields are intended to be used as facets:
metadata.predicate.date.<md5>
metadata.predicate.integer.<md5>
metadata.predicate.literal.<md5>
(this field is of type "text" which EntryStore automatically translates intometadata.predicate.literal_s.<md5>
for facetted queries)metadata.predicate.literal_s.<md5>
metadata.predicate.uri.<md5>
- All fields containing a URI
- All fields named with pattern
*Type
Examples¶
http://base/search?type=solr&query=organic&facetFields=metadata.predicate.literal.256bd150&filterQuery=metadata.predicate.literal.256bd150:crop
: Use dc:subject as facet and value "crop" as filter.
Related property index¶
The related property index contains properties from other entries that the entry's metadata links to. All relations that are to be followed must be configured using the following settings:
entrystore.solr.related=on|off (default: off)
entrystore.solr.related.properties.n=<predicate URI>
By default only entries in the same context are indexed. This can be changed per property by appending the "global" parameter to the property configuration. The URI and the parameter need to be separated by a comma:
entrystore.solr.related.properties.n=<predicate URI>,global
Example¶
entrystore.solr.related=on
entrystore.solr.related.properties.1=http://purl.org/dc/terms/creator,global
entrystore.solr.related.properties.2=http://schema.org/image
entrystore.solr.related.properties.3=http://schema.org/location
entrystore.solr.related.properties.4=http://schema.org/address
Inspecting an entry's index¶
The owner of an entry can inspect the indexed information of an entry by accessing its index URL which is contructed by appending /index
to the end of the entry URI.
Example: https://base/1/entry/1/index
The response body contains a JSON object with all indexed fields and its values.
Rebuilding the index¶
The index can be rebuilt upon startup (recommended) through a property in the EntryStore settings (see above). Alternatively, admin users (i.e. admin and members of the admin group) and context owners have the possibility to rebuild the Solr index or specific contexts live without restart through the management API using the following request:
POST http://base/management/solr
Content-Type: application/json
{
"command": "reindex",
"context": "https://base/_contexts/entry/context-id"
}
If the context
parameter is omitted, the whole repository will be reindexed.
See the Swagger documentation for further details.
Automatic reindexing of contexts¶
If the ACL of a context is updated in a way that changes the context's (and its entries') metadata readability (public/non-public via the _guest
user), the context is automatically scheduled for reindexing. There is a built-in delay of 10 seconds before reindexing start in case the client/user toggles the _guest
readability within a short time frame.