Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

Goal: decide on implementation

  1. How will we implement tags?
  2. How do we save these tags?
    • Option 1 - As part of the slide model in the MongoDB or
      • Pros: easy to add to our current model, same technology stack
      • Cons: possibly complex (+long running) queries to find things that are related to tags, more complex to add information from the WWW. for reasoning, inferring knowledge, semantic search we will need a RDF store with SPARQL endpoint working in a real time for users, right (T2.4) (Aleksandr KorovinKlaas Andries de Graaf agrees)
    • Option 1.1 - JSON-LD (Allan Third) (option 1.1. added by Klaas Andries de Graaf for discussion overview purposes - check if I misunderstood argument/pro's/con's)
      • Pros: there's no separation between the Mongo representation and the RDF. can help for SEO if embedded in pages (Ali Khalili ) . Graph-db supports JSON-LD (Aleksandr Korovin)
      • Cons: shouldn't be such a disruptive change, it should just involve adding some fields to the JSON that's there (Allan Third) add \@context field with context.json and that is all (Aleksandr Korovin ).May need additional mapper to triple-DB for fast SPARQL queries (see additions/complications below)
      • Additions/complications: needs SPARQL mapper for querying (Aleksandr Korovin). the on-the-fly conversion of a SPARQL request to a mongodb request can be too slow for complex SPARQL queries. I vote for a off-line RDF generation by means of mappers. (Mariano Rico) for performance reasons I also am more apt to using a triple store rather than on-the-fly query rewriting. go for both approaches, do a benchmarking and then decide (Ali Khalili) .
    • Option 2 - as a separate DB (e.g. a Graph DB - read next section)
      • Pros: simple queries, leverage default Semantic Web technologies (like interlink/infer knowledge on the WWW), no schema boundries
      • Cons: possibly not so good performance, another technology stack. Needs synchronising. How to search both GraphDB and MongoDB?
    • Other options? (Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
    • Related:
  3. How do we realize semi-automatic semantic annotation of decks and slides?
    • Option 1 - Use LD-R semi-automatic annotating feature (API?) (Ali Khalili
      • Pros: 
      • Cons: 
    • Option 2 - Implement our own function/service?
      • Pros: 
      • Cons: Possibly takes more time? Reuse existing libraries?
    • Option 3 - use tf-idf as a general measure to get important terms for deck/slides (Antje Schlaf)
      • Pros: 
        • get important terms independent on recognizing DBPedia (or other) Entities, should be used in combination (get important terms, link to known entities when possible)
        • terms are defined as special / important based on that they are frequent in current deck but not commonly frequent in other docs (e.g. "RDF" was often used in this deck but is not a term often used in decks in general)
      • Cons: 
        • possibly a bit general, but for the bigger languages we can use POS-Taggers etc. to restrict to nouns if wanted
        • document frequency of terms needs to be recalculated / updated and made stored / accessible for calculation
    • Option 4 - use machine learning (Antje Schlaf)
      • Pros: classification is fast when model is once trained, more time consuming model training just once necessary
      • Cons: 
        • needs representative amount of training data for each category, which we don't have. We could use this method if enough manual labeled tags are inside system
        • re-training necessary / recommended because new topics / labels are added to the growing system
    • Option 5 - use topic modeling (Antje Schlaf)
      • Pros:  
        • no manual training data needed
        • calculate topics based on given collection, result: word probabilities per topic and topic probabilities per document
      • Cons:
        • topics have no labels (but can be manually labeled based on the resulting topic words or just present a word clowd per topic for the user which might be ok as well)
        • not sure how to deal with growing collection. recalculation from time to time recommended, recalculation can lead to different topics also for older data
    • Other options?(Ali Khalili , Roy Meissner , Mariano Rico , Antje Schlaf , Luis Daniel Fernandes Rotger )
    • also important: language info probably required for all options, thus: automatic language recognition required to not rely on users setting correct language per deck (Antje Schlaf)






  • No labels