Goal: decide on implementation
- How will we implement tags?
- Option 1:As plain text that is appended to slide revisions and decks (e.g. tags: [mathematics, fibonacci, ...] or
- Pros: easy to implement
- Cons:
- Option 2: as URIs that are already linked to further knowledge (RDF)?
- Pros: we might be able to infer knowledge from the WWW, we might be able to present a user enriched recommendations
- Cons: we might not be able to link to distinct topics/concepts in case we are unsure about the meaning of a tag
- Other options? (Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
- Option 1:As plain text that is appended to slide revisions and decks (e.g. tags: [mathematics, fibonacci, ...] or
- How do we save these tags?
- Option 1 - As part of the slide model in the MongoDB or
- Pros: easy to add to our current model, same technology stack
- Cons: possibly complex (+long running) queries to find things that are related to tags, more complex to add information from the WWW
- Option 2 - as a separate DB (e.g. a Graph DB - read next section)
- Pros: simple queries, leverage default Semantic Web technologies (like interlink/infer knowledge on the WWW), no schema boundries
- Cons: possibly not so good performance, another technology stack
- Other options? (Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
- Related:
- RML - http://rml.io/
- Karma - https://usc-isi-i2.github.io/karma/
- SparqlMap - https://github.com/tomatophantastico/sparqlmap/
- morph-xr2rml - https://github.com/frmichel/morph-xr2rml
- Option 1 - As part of the slide model in the MongoDB or
- How do we realize semi-automatic semantic annotation of decks and slides?
- Option 1 - Use LD-R semi-automatic annotating feature (API?) (Ali Khalili
- Pros:
- Cons:
- Option 2 - Implement our own function/service?
- Pros:
- Cons: Possibly takes more time? Reuse existing libraries?
- Option 3 - use tf-idf as a general measure to get important terms for deck/slides (Antje Schlaf)
- Pros:
- get important terms independent on recognizing DBPedia (or other) Entities, should be used in combination (get important terms, link to known entities when possible)
- terms are defined as special / important based on that they are frequent in current deck but not commonly frequent in other docs (e.g. "RDF" was often used in this deck but is not a term often used in decks in general)
- Cons:
- possibly a bit general, but for the bigger languages we can use POS-Taggers etc. to restrict to nouns if wanted
- document frequency of terms needs to be recalculated / updated and made stored / accessible for calculation
- Pros:
- Option 4 - use machine learning (Antje Schlaf)
- Pros: classification is fast when model is once trained, more time consuming model training just once necessary
- Cons:
- needs representative amount of training data for each category, which we don't have. We could use this method if enough manual labeled tags are inside system
- re-training necessary / recommended because new topics / labels are added to the growing system
- Option 5 - use topic modeling (Antje Schlaf)
- Pros:
- no manual training data needed
- calculate topics based on given collection, result: word probabilities per topic and topic probabilities per document
- Cons:
- topics have no labels (but can be manually labeled based on the resulting topic words or just present a word clowd per topic for the user which might be ok as well)
- not sure how to deal with growing collection. recalculation from time to time recommended, recalculation can lead to different topics also for older data
- Pros:
- Other options?(Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
- also important: language info probably required for all options, thus: automatic language recognition required to not rely on users setting correct language per deck (Antje Schlaf)
- Option 1 - Use LD-R semi-automatic annotating feature (API?) (Ali Khalili