Custom facet querying an RDF graph projected by TDE : good practice?

tde
semantics
xquery

#1

Hi all,

I hope this is the good place for discussions about Marklogic as it seems the mailing list doesn’t exist anymore? Stackoverflow is not really adapted for my topic which is not a specific question but rather getting advices and feedbacks on good practice with ML.

I’m new to ML and start to learn and play with it for few weeks (#convid19containment !).
We have a application which is using mongoDB/SolR and we would like to switch to Marklogic.
To help analysing this migration, I made a sample xquery search application with ML storing our XML data.

At first impression I was a bit confused that it seems impossible to generate specific index by manipulating the XML and generate new values (concatenation, split, anything one can do with xpath). I used to this with SDX which allows to generate new index with an XSLT (cf. this tuto, sorry it’s in french). Aggregation and path-range-index can help in doing this but it’s really limited I found. Of course one could compute new XML elements at ingestion time (using an envelop pattern for instance) but I like when information is not “duplicated” at all, as far as possible !

So the tricky problem I had to face was about a facet whose data must be feed, not from the searched documents, but rather from another documents that have a link with it:
This data is the “Publishing number” : it does not appear in the article document itself but in the journal document(s) where the article has been published.
And we want to query articles by publishing number, yes we do :slight_smile:

How to solve this ?

The solution I tried is to use a TDE that project relations between article and journal.
And then I wrote a custom facet that use a SPARQL query to get the goods URIs.

The ideas comes from the semantic FAQ what-are-the-implications-of-faceted-search, but I couldn’t find any example of such a facet.
The more similar code I found was on semantic-news-search but the triples are not stored in the semantic RDF graph here, rather in the property document which is quite different (no SPARQL here) I guess.

Please see below my facet module code. It was not easy to achieve this, but I finally get it to work today!

My questions are more about your feedback on this kind of custom contraint:

  • Did you ever have to write such custom facet that query the RDF Graph while searching documents?
  • Is this a good practice to do so?
  • Are there other way to achieve this?
  • Isn’t the optic API more appropriate for such things? (I couldn’t figure how to use it for my pupose)
  • Are there things in the code that is really bad ? or will cause problems in production ?
    (I’m aware the regex filter might be to big when there are a lot of documents, I guess I’ll use a cts filter on my Sparql expression)
  • Any other feedbacks welcome!

I guess if this this approach looks acceptable I’ll try to generalize the principle (juste like in semantic-news-search) so I can query any triple.
This would sound like a good solution to be able to generate some new indexe entry (not real index as ML ones but usefull for searching)

Thanks in advance,
Matthieu

My custom facet code:

xquery version "1.0-ml";

module namespace xf = "http://www.lefebvre-sarrut.eu/ns/xmlfirst";

import module namespace sem = "http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";

declare namespace search = "http://marklogic.com/appservices/search";
declare namespace cts = "http://marklogic.com/cts";
declare namespace sparql-results = "http://www.w3.org/2005/sparql-results#";

(:======================================================:)
(:CUSTOM FACET:)
(:======================================================:)

(:Documentation:
  This custom facet allows to get the publishing number (numeroOrdre) of an article (childEE) while searching among article only.
  This publishing number does not appears in the article document itself but in the journal (parentEE) document(s) 
  where the article has been published
  A TDE has been setted to project xml data from articles and journals to a graph of RDF triples, espacially : 
  - the relation between article and journal : xf:parent
  - the publishing number of each journal : xf:META_EFL_META_numeroOrdre
  This custom facet uses the RDF graph to resolve the relation between article and publishing number
  
  To use this facet, set this constraint in your search:search options :
  <constraint name="NumeroOrdreInParent">
    <custom facet="true">
     <parse apply="parse" ns="http://www.lefebvre-sarrut.eu/ns/xmlfirst" at="/modules/xf-facet.mod.xqy"/>
     <start-facet apply="start-facet" ns="http://www.lefebvre-sarrut.eu/ns/xmlfirst" at="/modules/xf-facet.mod.xqy"/>
     <finish-facet apply="finish-facet" ns="http://www.lefebvre-sarrut.eu/ns/xmlfirst" at="/modules/xf-facet.mod.xqy"/>
    </custom>
  </constraint>:)


(:Documentation: 
  xf:parse-numeroOrdre allows to have facetName:value in the query string 
  For example NumeroOrdreInParent:12 in this case
  How it works? 
  1) query the RDF graph with Sparql to get all URIs of childEE whose parentEE has the requested numeroOrdre
  2) return a cts:document-query with thoses URIs
  This query will be added to the search:search query which uses this facet in its options:)
declare function parse-numeroOrdre(
  $constraint-qtext as xs:string, 
  $right as schema-element(cts:query)) 
as schema-element(cts:query)
{
  
  let $s as xs:string:= string($right//cts:text/text())
  let $sparqlQuery as xs:string := 
    <myQuery>
      PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      PREFIX rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#>
      PREFIX xf: &lt;http://www.lefebvre-sarrut.eu/ns/xmlfirst#>
      
      SELECT ?childEE ?uri
      WHERE {{
        ?childEE rdf:type xf:EditorialEntity .
        ?parentEE rdf:type xf:EditorialEntity .
        ?childEE xf:parent ?parentEE .
        ?parentEE xf:META_EFL_META_numeroOrdre "{$s}" .
        ?childEE xf:doc-uri ?uri .
      }}
    </myQuery>/text()
  
  let $triples as item()* := sem:sparql($sparqlQuery)
  let $triples-xml as element(sparql-results:sparql) := sem:query-results-serialize($triples, "xml")
  (:let $_ := xdmp:log($triples-xml):)
  let $uris as xs:string* := $triples-xml//sparql-results:binding[@name='uri']/string(.)
  
  return
    (: add qtextconst attribute so that search:unparse will work - required for some search library functions :)
    (:see http://blog.davidcassel.net/2011/08/unparsing-a-custom-facet for more explanations:)
    <cts:document-query qtextconst="{concat($constraint-qtext, string($right//cts:text))}">
      {
        for $uri in $uris return 
          <cts:uri>{$uri}</cts:uri>
      }
    </cts:document-query>
    
};

(:Documentation: 
  xf:start-facet-numeroOrdre generate the values of the facet, its completed by xf:finish-facet-numeroOrdre 
  which generate the good format. It's usefull to have these 2 fonction for optimisation reasons.
  How it works? 
  1) Get all document URIs of the result of the current search
  2) Query the RDF Graph to get all numeroOrde of parentEE which have childEE attached
  3) Filter the graph result on URIs found at step 1
  4) Get all distinct values of the resulted numeroOrde
:)
declare function start-facet-numeroOrdre(
  $constraint as element(search:constraint),
  $query as cts:query?,
  $facet-options as xs:string*,
  $quality-weight as xs:double?,
  $forests as xs:unsignedLong*)
as item()*
{

let $currentSearchUris as xs:string* := 
  for $uri in cts:uris((), ($facet-options, "concurrent"), $query, $quality-weight, $forests)
  return string($uri)

let $currentSearchUrisRegex as xs:string := concat('^(', string-join($currentSearchUris, ' | '),'$)')
  
let $sparqlQuery as xs:string := 
  <myQuery>
    PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#>
    PREFIX xf: &lt;http://www.lefebvre-sarrut.eu/ns/xmlfirst#>
    
    SELECT ?numeroOrdre
    WHERE {{
      ?childEE rdf:type xf:EditorialEntity .
      ?parentEE rdf:type xf:EditorialEntity .
      ?childEE xf:parent ?parentEE .
      ?childEE xf:doc-uri ?childEEURI .
      ?parentEE xf:META_EFL_META_numeroOrdre ?numeroOrdre .
      # Contextualize the query to the current search
      FILTER (regex (?childEEURI, "{$currentSearchUrisRegex}", "x"))
    }}
  </myQuery>/text()
  (:let $_ := xdmp:log($sparqlQuery):)
  
  let $triples as item()* := sem:sparql($sparqlQuery)
  let $triples-xml as element(sparql-results:sparql) := sem:query-results-serialize($triples, "xml")

  for $numeroOrdre in distinct-values($triples-xml//sparql-results:binding[@name='numeroOrdre']/string(.)) 
  return 
    <value name="{$numeroOrdre}" 
           count="{count($triples-xml//sparql-results:binding[@name='numeroOrdre'][. = $numeroOrdre])}"/>
};

(:Documentation:
  xf:finish-facet-numeroOrdre get the result of start-facet-numeroOrdre in $start argument
  It only format it as expected by the API:)
declare function finish-facet-numeroOrdre(
  $start as item()*,
  $constraint as element(search:constraint), 
  $query as cts:query?,
  $facet-options as xs:string*,
  $quality-weight as xs:double?, 
  $forests as xs:unsignedLong*)
as element(search:facet)
{

  <search:facet name="{$constraint/@name}">
  {
    for $val in $start
    return
      <search:facet-value name="{$val/@name}" count="{$val/@count}">
        {string($val/@name)}
      </search:facet-value>
  }
  </search:facet>
};

#2

This is another version using cts:triples instead of Sparql :

  • A bit less easy to SELECT triples as Sparql

  • cts:triple can take the query as 6th arg which is a quite better filtering way as the sparql FILTER with regex !

      xquery version "1.0-ml";
      
      module namespace xf = "http://www.lefebvre-sarrut.eu/ns/xmlfirst";
      
      (:======================================================:)
      (:CUSTOM FACET:)
      (:======================================================:)
      
      (:Documentation:
      This custom facet allows to get the publishing number (numeroOrdre) of an article (childEE) while searching among article only.
      This publishing number does not appears in the article document itself but in the journal (parentEE) document(s) 
      where the article has been published
      A TDE has been setted to project xml data from articles and journals to a graph of RDF triples, espacially : 
      - the relation between article and journal : xf:parent
      - the publishing number of each journal : xf:META_EFL_META_numeroOrdre
      This custom facet uses the RDF graph to resolve the relation between article and publishing number
      
      To use this facet, set this constraint in your search:search options :
      <constraint name="NumeroOrdreDansRevueCtsTriples">
        <custom facet="true">
          <parse apply="parse-numeroOrdre" ns="http://www.lefebvre-sarrut.eu/ns/xmlfirst" at="/modules/xf-facet-cts-triples_numeroOrdre.mod.xqy"/>
          <start-facet apply="start-facet-numeroOrdre" ns="http://www.lefebvre-sarrut.eu/ns/xmlfirst" at="/modules/xf-facet-cts-triples_numeroOrdre.mod.xqy"/>
          <finish-facet apply="finish-facet-numeroOrdre" ns="http://www.lefebvre-sarrut.eu/ns/xmlfirst" at="/modules/xf-facet-cts-triples_numeroOrdre.mod.xqy"/>
        </custom>
      </constraint>
      :)
      
      declare namespace search = "http://marklogic.com/appservices/search";
      
      declare variable $rdf := 'http://www.w3.org/1999/02/22-rdf-syntax-ns#';
      declare variable $rdfs := 'http://www.w3.org/2000/01/rdf-schema#';
      declare variable $xf := 'http://www.lefebvre-sarrut.eu/ns/xmlfirst#';
      
      
      (:Documentation: 
      xf:parse-numeroOrdre allows to have facetName:value in the query string 
      For example NumeroOrdreInParent:12 in this case
      How it works? 
      1) query the RDF graph with cts:triples to get all URIs of docs whose parent doc has the requested numeroOrdre
      2) return a cts:document-query with thoses URIs
      This query will be added to the search:search query which uses this facet in its options
      :)
    
      declare function parse-numeroOrdre(
      $constraint-qtext as xs:string, 
      $right as schema-element(cts:query)
      as schema-element(cts:query)
      {
      let $s as xs:string := string($right//cts:text/text())
      
      (:Filter every doc which meta numeroOrdre has the requested value, then get its children uri:)
      let $uris as xs:anyURI* :=
          cts:triples( ((:parentDoc:)) , sem:iri($xf||'META_EFL_META_numeroOrdre') , ($s))
          ! cts:triples( ((:doc:)), sem:iri($xf||'parent'), ((:parentDoc:)sem:triple-subject(.)) )
          ! cts:triples( ((:doc:)sem:triple-subject(.)), sem:iri($xf||'doc-uri'), ((:docUri:)) )
          ! sem:triple-object(.)
       
      return
        (: add qtextconst attribute so that search:unparse will work - required for some search library functions :)
        (:see http://blog.davidcassel.net/2011/08/unparsing-a-custom-facet for more explanations:)
        <cts:document-query qtextconst="{concat($constraint-qtext, string($right//cts:text))}">
          {
            for $uri in $uris return 
              <cts:uri>{$uri}</cts:uri>
          }
        </cts:document-query>
        
      };
      
      (:Documentation: 
      xf:start-facet-numeroOrdre generate the values of the facet, its completed by xf:finish-facet-numeroOrdre 
      which generate the good format. It's usefull to have these 2 fonction for optimisation reasons.
      :)
    
      declare function start-facet-numeroOrdre(
      $constraint as element(search:constraint),
      $query as cts:query?,
      $facet-options as xs:string*,
      $quality-weight as xs:double?,
      $forests as xs:unsignedLong*)
      as item()*
      {
      
       let $values as xs:string* :=
          (:get all documents of the current query - choose doc-uri as predicate so we have one triple by doc:)
          cts:triples( ((:doc:)), (sem:iri($xf||'doc-uri')), (),'=', (), $query )
          (:get parent documents of theses documents:) 
          ! cts:triples((:doc:)sem:triple-subject(.), (sem:iri($xf||'parent')), ((:parentDoc:)) )
          (:get numeroOrdre of theses parent documents:) 
          ! cts:triples((:parentDoc:)sem:triple-object(.), (sem:iri($xf||'META_EFL_META_numeroOrdre')), ((:numeroOrdre:)) )
          ! sem:triple-object(.)
          
       for $val in distinct-values($values) 
        return 
        <value name="{$val}" count="{count($values[. = $val])}"/>
      };
      
      (:Documentation:
      xf:finish-facet-numeroOrdre get the result of start-facet-numeroOrdre in $start argument
      It only format it as expected by the API
      :)
      
      declare function finish-facet-numeroOrdre(
      $start as item()*,
      $constraint as element(search:constraint), 
      $query as cts:query?,
      $facet-options as xs:string*,
      $quality-weight as xs:double?, 
      $forests as xs:unsignedLong*)
      as element(search:facet)
      {
      
      <search:facet name="{$constraint/@name}">
      {
        for $val in $start
        return
          <search:facet-value name="{$val/@name}" count="{$val/@count}">
            {string($val/@name)}
          </search:facet-value>
      }
      </search:facet>
      };