Design options and decisions: Technical Implementation of Semantic representation

Description/requirements for D2.3:

(from work packages in grant agreement - https://drive.google.com/open?id=0B4Qow4ezpDrNcnFScklybVZyNXc ) -

In this task, we will develop and align SlideWiki content with appropriate ontologies (e.g. DublinCore, SCORM, FOAF, SIOC, Schema.org) for representing semantics of OpenCourseWare material. We will use an RDB2RDF mapping approach (e.g. SparqlMap ) for dynamically mapping the existing relational multi-versioning data structure to our semantic model. Providing suitable user interfaces for annotation of content is another goal of this task. We will customize RDFaCE semantic editor to support manual annotation of content based on RDFa and Microdata markups. For automatic content annotation and interlinking, FOX and LIMES will be employed. We will also take advantage of the generated annotations to create a recommender system to propose related content to authors based on information context and user profile.

Goal: decide on implementation

How will we implement tags?
- Option 1:As plain text that is appended to slide revisions and decks (e.g. tags: [mathematics, fibonacci, ...] or
  - Pros: easy to implement
  - Cons: enriching the tags with data from the WWW might become difficult, as we have no place where to save additional information(Roy Meissner )
- Option 2: as URIs that are already linked to further knowledge (RDF)?
  - Pros: we might be able to infer knowledge from the WWW, we might be able to present a user enriched recommendations
  - Cons: we might not be able to link to distinct topics/concepts in case we are unsure about the meaning of a tag.
- Option 3: as both URIs and Tags (Mariano Rico ) providing both options as Plain text and url like this: '{ name: Scientific, url: http://dbpedia.org/ontology/Scientific }'. (Aleksandr Korovin ) (Option 3 added by Klaas Andries de Graaf for discussion overview purposes - check if I misunderstood argument/pro's/con's)
  - Pros: Can get language label via DBpedia via URI (Mariano Rico , see SWIK-903). User can still specify tag without URL (Aleksandr Korovin) + link is optional for a tag and links are not requested from users (Roy Meissner ). users are able to add plaintext tags that we might enrich later + urls for tags are added later by our automatic system (e.g. a Named Entity Recognition that is able to disambiguate) (Roy Meissner )
  - Cons: not good for keyword searching in MongoDB - Typical model data is an array (https://docs.mongodb.com/manual/tutorial/model-data-for-keyword-search/). Need to test query that will search in tags array by name: tags: \[{name: 'name1', ...}, \{name: 'name2'} , .... ] (Aleksandr Korovin) → also in SOLR? (Serafeim Chatzopoulos )
  - Possible complication: disambiguation problem tag → URI link (Roy Meissner). need a custom mapping for tags with some meta information (e.g. slide202 → {tag1 → {sameAs: <url>, algorithm: NamedEntityRecognition, manuallyAdded: false, ...}, tag2 → ...} (Roy Meissner)
- Other options? (Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
How do we save these tags?
1. DECISION 1: save tags as part of the current MongoDB, right at slides/decks (either as JSON-LD or plain JSON) - passed with 5 votes (Roy, Klaas, Alex, Luis, Paul) > 50% of involved
  Decision 2: Option 2) as plain JSON passed with 5 votes (Roy, Antje, Klaas, Luis, Alex) (> 50% 5/9 possible votes)
- Option 1 - As part of the slide model in the MongoDB or
  - Pros: easy to add to our current model (Roy Meissner), same technology stack. use Solr to have a (however materialized) index for tags, so we are able to efficiently search among these (Roy Meissner)
  - Cons: possibly complex (+long running) queries to find things that are related to tags, more complex to add information from the WWW. for reasoning, inferring knowledge, semantic search we will need a RDF store with SPARQL endpoint working in a real time for users, right (T2.4) (Aleksandr Korovin , Klaas Andries de Graaf agrees). Needs mapping (also holds for option 1.1 ? see point "additions/complications")
- Option 1.1 - JSON-LD (Allan Third) (option 1.1. added by Klaas Andries de Graaf for discussion overview purposes - check if I misunderstood argument/pro's/con's)
  - Pros: there's no separation between the Mongo representation and the RDF. can help for SEO if embedded in pages (Ali Khalili ) . Graph-db supports JSON-LD (Aleksandr Korovin)
  - Cons: shouldn't be such a disruptive change, it should just involve adding some fields to the JSON that's there (Allan Third) add \@context field with context.json and that is all (Aleksandr Korovin ).May need additional mapper to triple-DB for fast SPARQL queries (see additions/complications below).
    The discussion was originally how we store the tags in mongoDB, and if we need to do it with JSON-LD to store the tags. Like Roy said, after a longer discussion here, we came to the conclusion that just for the storing decision normal json would be sufficient for now. To implement storing as JSON-LD requires design decisions for a possible later RDF graph. We came to the result, that we can also store it just as normal json and as soon someone wants to create an RDF graph just a good ETL to RDF process is needed. So no need to discuss details of JSON-LD if we don't know yet what this potential RDF graph should look like. (Antje Schlaf)
  - Additions/complications: needs SPARQL mapper for querying (Aleksandr Korovin , Roy Meissner). the on-the-fly conversion of a SPARQL request to a mongodb request can be too slow for complex SPARQL queries. I vote for a off-line RDF generation by means of mappers. (Mariano Rico) for performance reasons I also am more apt to using a triple store rather than on-the-fly query rewriting. go for both approaches, do a benchmarking and then decide (Ali Khalili). transfer (ETL) the database content (or just changes since last time) to an actual Graph Store from time to time (Roy Meissner ) .
- Option 2 - as a separate DB (e.g. a Graph DB - read next section)
  - Pros: simple queries, leverage default Semantic Web technologies (like interlink/infer knowledge on the WWW), no schema boundries
  - Cons: possibly not so good performance, another technology stack. Needs synchronising (Roy Meissner). How to search both GraphDB and MongoDB?
- Other options? (Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
- Related:
  - RML - http://rml.io/
  - Karma - https://usc-isi-i2.github.io/karma/
  - SparqlMap - https://github.com/tomatophantastico/sparqlmap/
  - morph-xr2rml - https://github.com/frmichel/morph-xr2rml
How do we realize semi-automatic semantic annotation of decks and slides?

- Option 0 - (basic) annotator tab with list of annotations per deck and slide, similar to sources tab (Klaas Andries de Graaf )
  - Pros; Allows users to easily manually add/admin/change the annotations. Can be extended with automatic things on the go/later on.
  - cons: is not automatic - needs to be extended. Users need to manually add a tag name, and URI for linking to e.g., DPpedia concept.
- Option 1 - Use LD-R semi-automatic annotating feature (API?) (Ali Khalili
  - Pros:
  - Cons:
- Option 2 - Implement our own function/service?
  - Pros:
  - Cons: Possibly takes more time? Reuse existing libraries?
- Option 3 - use tf-idf as a general measure to get important terms for deck/slides (Antje Schlaf)
  - Pros:
    - get important terms independent on recognizing DBPedia (or other) Entities, should be used in combination (get important terms, link to known entities when possible)
    - terms are defined as special / important based on that they are frequent in current deck but not commonly frequent in other docs (e.g. "RDF" was often used in this deck but is not a term often used in decks in general)
  - Cons:
    - possibly a bit general, but for the bigger languages we can use POS-Taggers etc. to restrict to nouns if wanted
    - document frequency of terms needs to be recalculated / updated and made stored / accessible for calculation
- Option 4 - use machine learning (Antje Schlaf)
  - Pros: classification is fast when model is once trained, more time consuming model training just once necessary
  - Cons:
    - needs representative amount of training data for each category, which we don't have. We could use this method if enough manual labeled tags are inside system
    - re-training necessary / recommended because new topics / labels are added to the growing system
- Option 5 - use topic modeling (Antje Schlaf)
  - Pros:
    - no manual training data needed
    - calculate topics based on given collection, result: word probabilities per topic and topic probabilities per document
  - Cons:
    - topics have no labels (but can be manually labeled based on the resulting topic words or just present a word clowd per topic for the user which might be ok as well)
    - not sure how to deal with growing collection. recalculation from time to time recommended, recalculation can lead to different topics also for older data
- Other options?(Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
- also important: language info probably required for all options, thus: automatic language recognition required to not rely on users setting correct language per deck (Antje Schlaf)

Design options and decisions: Technical Implementation of Semantic representation

[data-colorid=npbcs6xidc]{color:#1155cc} html[data-color-mode=dark] [data-colorid=npbcs6xidc]{color:#3377ee}Description/requirements for D2.3:

Description/requirements for D2.3: