Design options and decisions: Technical Implementation of Semantic representation

Goal: decide on implementation

How will we implement tags?
- Option 1:As plain text that is appended to slide revisions and decks (e.g. tags: [mathematics, fibonacci, ...] or
  - Pros: easy to implement
  - Cons: enriching the tags with data from the WWW might become difficult, as we have no place where to save additional information(Roy Meissner )
- Option 2: as URIs that are already linked to further knowledge (RDF)?
  - Pros: we might be able to infer knowledge from the WWW, we might be able to present a user enriched recommendations
  - Cons: we might not be able to link to distinct topics/concepts in case we are unsure about the meaning of a tag.
- Option 3: as both URIs and Tags (Mariano Rico ) providing both options as Plain text and url like this: '{ name: Scientific, url: http://dbpedia.org/ontology/Scientific }'. (Aleksandr Korovin ) (Option 3 added by Klaas Andries de Graaf for discussion overview purposes - check if I misunderstood argument/pro's/con's)
  - Pros: Can get language label via DBpedia via URI (Mariano Rico , see SWIK-903). User can still specify tag without URL (Aleksandr Korovin) + link is optional for a tag and links are not requested from users (Roy Meissner ). users are able to add plaintext tags that we might enrich later + urls for tags are added later by our automatic system (e.g. a Named Entity Recognition that is able to disambiguate) (Roy Meissner )
  - Cons: not good for keyword searching in MongoDB - Typical model data is an array (https://docs.mongodb.com/manual/tutorial/model-data-for-keyword-search/). Need to test query that will search in tags array by name: tags: \[{name: 'name1', ...}, \{name: 'name2'} , .... ] (Aleksandr Korovin) → also in SOLR? (Serafeim Chatzopoulos )
  - Possible complication: disambiguation problem tag → URI link (Roy Meissner). need a custom mapping for tags with some meta information (e.g. slide202 → {tag1 → {sameAs: <url>, algorithm: NamedEntityRecognition, manuallyAdded: false, ...}, tag2 → ...} (Roy Meissner)
- Other options? (Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
How do we save these tags?
1. DECISION: save tags as part of the current MongoDB, right at slides/decks (either as JSON-LD or plain JSON) - passed with 5 votes (Roy, Klaas, Alex, Luis, Paul) > 50% of involved
- Option 1 - As part of the slide model in the MongoDB or
  - Pros: easy to add to our current model (Roy Meissner), same technology stack. use Solr to have a (however materialized) index for tags, so we are able to efficiently search among these (Roy Meissner)
  - Cons: possibly complex (+long running) queries to find things that are related to tags, more complex to add information from the WWW. for reasoning, inferring knowledge, semantic search we will need a RDF store with SPARQL endpoint working in a real time for users, right (T2.4) (Aleksandr Korovin , Klaas Andries de Graaf agrees). Needs mapping (also holds for option 1.1 ? see point "additions/complications")
- Option 1.1 - JSON-LD (Allan Third) (option 1.1. added by Klaas Andries de Graaf for discussion overview purposes - check if I misunderstood argument/pro's/con's)
  - Pros: there's no separation between the Mongo representation and the RDF. can help for SEO if embedded in pages (Ali Khalili ) . Graph-db supports JSON-LD (Aleksandr Korovin)
  - Cons: shouldn't be such a disruptive change, it should just involve adding some fields to the JSON that's there (Allan Third) add \@context field with context.json and that is all (Aleksandr Korovin ).May need additional mapper to triple-DB for fast SPARQL queries (see additions/complications below).
    The discussion was originally how we store the tags in mongoDB, and if we need to do it with JSON-LD to store the tags. Like Roy said, after a longer discussion here, we came to the conclusion that just for the storing decision normal json would be sufficient for now. To implement storing as JSON-LD requires design decisions for a possible later RDF graph. We came to the result, that we can also store it just as normal json and as soon someone wants to create an RDF graph just a good ETL to RDF process is needed. So no need to discuss details of JSON-LD if we don't know yet what this potential RDF graph should look like. (Antje Schlaf)
  - Additions/complications: needs SPARQL mapper for querying (Aleksandr Korovin , Roy Meissner). the on-the-fly conversion of a SPARQL request to a mongodb request can be too slow for complex SPARQL queries. I vote for a off-line RDF generation by means of mappers. (Mariano Rico) for performance reasons I also am more apt to using a triple store rather than on-the-fly query rewriting. go for both approaches, do a benchmarking and then decide (Ali Khalili). transfer (ETL) the database content (or just changes since last time) to an actual Graph Store from time to time (Roy Meissner ) .
- Option 2 - as a separate DB (e.g. a Graph DB - read next section)
  - Pros: simple queries, leverage default Semantic Web technologies (like interlink/infer knowledge on the WWW), no schema boundries
  - Cons: possibly not so good performance, another technology stack. Needs synchronising (Roy Meissner). How to search both GraphDB and MongoDB?
- Other options? (Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
- Related:
  - RML - http://rml.io/
  - Karma - https://usc-isi-i2.github.io/karma/
  - SparqlMap - https://github.com/tomatophantastico/sparqlmap/
  - morph-xr2rml - https://github.com/frmichel/morph-xr2rml
How do we realize semi-automatic semantic annotation of decks and slides?

- Option 0 - (basic) annotator tab with list of annotations per deck and slide, similar to sources tab (Klaas Andries de Graaf )
  - Pros; Allows users to easily manually add/admin/change the annotations. Can be extended with automatic things on the go/later on.
  - cons: is not automatic - needs to be extended. Users need to manually add a tag name, and URI for linking to e.g., DPpedia concept.
- Option 1 - Use LD-R semi-automatic annotating feature (API?) (Ali Khalili
  - Pros:
  - Cons:
- Option 2 - Implement our own function/service?
  - Pros:
  - Cons: Possibly takes more time? Reuse existing libraries?
- Option 3 - use tf-idf as a general measure to get important terms for deck/slides (Antje Schlaf)
  - Pros:
    - get important terms independent on recognizing DBPedia (or other) Entities, should be used in combination (get important terms, link to known entities when possible)
    - terms are defined as special / important based on that they are frequent in current deck but not commonly frequent in other docs (e.g. "RDF" was often used in this deck but is not a term often used in decks in general)
  - Cons:
    - possibly a bit general, but for the bigger languages we can use POS-Taggers etc. to restrict to nouns if wanted
    - document frequency of terms needs to be recalculated / updated and made stored / accessible for calculation
- Option 4 - use machine learning (Antje Schlaf)
  - Pros: classification is fast when model is once trained, more time consuming model training just once necessary
  - Cons:
    - needs representative amount of training data for each category, which we don't have. We could use this method if enough manual labeled tags are inside system
    - re-training necessary / recommended because new topics / labels are added to the growing system
- Option 5 - use topic modeling (Antje Schlaf)
  - Pros:
    - no manual training data needed
    - calculate topics based on given collection, result: word probabilities per topic and topic probabilities per document
  - Cons:
    - topics have no labels (but can be manually labeled based on the resulting topic words or just present a word clowd per topic for the user which might be ok as well)
    - not sure how to deal with growing collection. recalculation from time to time recommended, recalculation can lead to different topics also for older data
- Other options?(Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
- also important: language info probably required for all options, thus: automatic language recognition required to not rely on users setting correct language per deck (Antje Schlaf)