Semantic representation
Description/requirements for D2.3 and (related) T2.4:
(from work packages in grant agreement - https://drive.google.com/open?id=0B4Qow4ezpDrNcnFScklybVZyNXc ) -
D2.3 SlideWiki annotator module (M15). SlideWiki component for semi-automatic semantic annotation of content using the ontologies and existing vocabularies.
T2.4: In this task, we will develop and align SlideWiki content with appropriate ontologies (e.g. DublinCore, SCORM, FOAF, SIOC, Schema.org) for representing semantics of OpenCourseWare material. We will use an RDB2RDF mapping approach (e.g. SparqlMap ) for dynamically mapping the existing relational multi-versioning data structure to our semantic model. Providing suitable user interfaces for annotation of content is another goal of this task. We will customize RDFaCE semantic editor to support manual annotation of content based on RDFa and Microdata markups. For automatic content annotation and interlinking, FOX and LIMES will be employed. We will also take advantage of the generated annotations to create a recommender system to propose related content to authors based on information context and user profile.
Proposal (work in progress - please participate):
Concrete tasks for the near future (and deliverable 2.3):
- Design a proper and user friendly tagging view - Abi James and Klaas Andries de Graaf
- Implement frontend part that enables users to tag slides/decks and to help them finding existing tags - maybe Klaas Andries de Graaf ?
- Research and implement a program that analyzes slides/decks to find topics that will appear as tags - Antje Schlaf is already working on it
- Save tags as ??? somewhere (MongoDB or separate GraphDB?) - maybe Former user (Deleted)
- Link tags to further knowledge (from the internet) - Mariano Rico was interested, Antje Schlaf will also participate
- Transfer tags from SlideWiki V1 DB to SlideWiki V2 DB - Darya Tarasowa should be best for this job
- ???
Open Questions:
- How will we implement tags? As plain text that is appended to slide revisions and decks (e.g. tags: [mathematics, fibonacci, ...] or as URIs that are already linked to further knowledge (RDF)?
- for the second option: How do we make sure that the concept is distinct, so we can link it to further knowledge? E.g. someone tags a slide with "bench" .... did he meant furniture or law (maybe not the best example in English)
- how do we handle concepts/topics in case we can't distinguish topics (so we can't decide whether to link to bench (furniture) or bench (law))?
- for the second option: How do we save these tags? As part of the slide model in the MongoDB or as a separate DB (e.g. a Graph DB - read next section)
- for the second option: Do we want to show to users that they are working with RDF (e.g. by showing them URIs)?
- regardless if we model the tags inside MongoDB or as RDF-graph: a representation of a tag should be possible even if it can't be assigned to further knowledge. The link to further knowledge entity should be optional. In the process it is 2 steps: 1) get tags 2) try to link them to further knowledge if possible
- ???f
- define tag fields/features: what do we need here? of course not only 1 but several tags per deck, plus each tag having a certain features: string for the tag name itself, a probability assigned and a source (since the tag might come from different sources like manually assigned, tag-algorithmA, tag-algoB, etc.). With this we have the possibility to say: Show tags which are assigned manually and for algoA above threshold X, algoB above threshold Y. If there is a recognized link from the tag to a dbpedia-entity (or other ontology entity) this should also somehow be modeled. (depending on if the tags are stored in MongoDB or as graph like suggested / asked by Roy)
- storing tag results vs. showing tags to user: there should be a difference between storing the tagging results and showing them (e.g. through thresholds). Should the user be allowed to remove non-fitting tags which result from an automatic assignment? If we do not delete them but just do not show them anymore, we can later learn which tags from which algorithms work best and which get often removed. That's why we should keep them somehow internally. When the platform is later nicely manually tagged by the users, we can use this data as evaluation data for our automatic algos as well as as training data to train models for a better automatic assignment. But this would require something like a notshow / ignore field to allow the user to remove an automatic tag even it has high probability.
- ???
Concrete Tasks for the future:
We wrote in D2.1 that we want to "semantify" the whole SlideWiki by transferring all of the data to a (or more than one) RDF graph. Is this still valid?
If yes, we could save tags already as part of a GraphDB, that covers a simplified set of the MongoDB (Slide ID and Deck ID to tags). → Starting point for future work (future deliverables)
How do we want to include RDFa into the SlideWiki frontend?
Real use cases for the semantic representation via RDF:
- Interlink several SlideWiki instances (in case we implement versioning as part of the data store)
- Interlink knowledge (slide/deck content and tags) to further knowledge on the web and provide users with a "discover view" to find related topics, superior topics/concepts, other decks, sources (e.g. papers), information (like wikipedia articles), ....
- queries which might be easier via graph: show decks which have the same tags / are also high connected to tag collection x, show users which use similar tags me, what other tags have decks/users who use the same tags like my deck...
- ???
Academic use cases for the semantic representation:
- Having a Slide/Deck ontology at http://lov.okfn.org/dataset/lov that can be used by other researches/developers
- Having another open SPARQL endpoint with lot's of data
- Semantic description of service APIs (Topic service description)
- Having (in the end of the project) a knowledge base about different NLP metrics that are valid for slides/decks (that are different from books, papers, twitter, ...) (you may specify this in a better way, Antje Schlaf)
- ???
InfAI:
For a semantic-representation of our current data, my colleagues recommended the tools:
- RML - http://rml.io/
- Karma - https://usc-isi-i2.github.io/karma/
- SparqlMap - https://github.com/tomatophantastico/sparqlmap/
- morph-xr2rml - https://github.com/frmichel/morph-xr2rml
Depending on our use-cases of a RDF representation, it might be sufficient to only map data and queries to our actual MongoDB content. In case we have more sophisticated usage scenarios (e.g. a lot of requests), it is better to have an ETL process to a fully featured triple-store.
Anyway, we came up with the following vocabs to (basically) describe slides, decks, users and all the rest:
- Sioc
- Doap
- DC
- Foaf
As another colleague of mine researches about co-evolution of RDF based data, he has an extensive overview of changeset describing vocabs, that might fit for our revision model of slides/decks.
Notes hangout 16-12-2016 (Roy, Antje, Klaas):
Multiple use-case for semantic representation/annotation:
- In-page semantic annotation (manual add+view. Use NLP?) based on LOD cloud and custom ontologies (Upload own ontology?!). makes slide content more explicit - Good for learners/teaching (RQ: Klaas)
- NLP/Named entity recognition (LSI?) on slide content to detect topics of slide/slides/decks - link to LOD
- Be an LOD provider - RDF store - store ontologies + instances + provide SPARQL query endpoint
- Semantic search - search in semantic annotations / decks/slides in RDF. cluster slides/decks based on topics
- Translate RDF/semantic-annotations?!
- We already have a system for tagging → help in determining topics / link to LOD cloud. Antje: could use of training - currently lack of data.
- Someone creates deck(presentation) or slide - get recommendations for similar decks/slide that can be reused - do you want to add/reuse this slide? (Antje Schlaf - possible RQ?) – Klaas: different languages - same/similar topic? Antje: difficult if decks have only pictures. Klaas: needs OCR...
- Share public (CC0) SlideWiki(.org) content among all instances (even private ones)
Do we want to semantically represent all data on slidewiki (users, comments, etc..) or only public data, or only tags/semantic annotions? More data is more use cases.
Research questions:
- Klaas - RQ: Do in-slide semantic annotations and (possibly) domain ontology of teaching materials/didactic help learners (slide consumers) and teachers/instructors (slide consumers) in better learning (better grade results, better understanding, support for diverse students (languages)) + better learning analytics? Relates to use case 1, 2 (for analytics) + use case 4 + use case 5
- Klaas - RQ: Can we infer new knowledge based on annotations on slides? Relates to use case 1 + use case 3 + use case 4 + use case 5
- Klaas - RQ: can we propose an ontology + instances + relationships + complete knowledge base for a certain deck/lecture series based on the annotated instances in a deck/series of slides? Relates to use case 1, 2, 3
- Klaas (with Darya Tarasowa?) - Can we generate exams based on semantic annotations/representation?
- Roy - RQ: Can we push just the "latest" changes of the semantic representation (depends on what it covers) to other instances of SlideWiki, in order to make content available and searchable? E.g. a versioned semantic representation, that shares diffs, but instances do not need to be synchronized - use case 3, 4 and 8
- Roy - RQ: How do we spread the information among SlideWiki instances, based on decentralized Linked Data principles? - use case 3 and 8
- Abi - RQ: How do we make an effective UI for semantic annotations when content consumers have little knowledge of linked data etc - use case 1, 4, 7, 8
- Abi - RQ: Semantic annotations often proposed to improve annotations. Can we provide evidence of this?
Revisions - if we tag decks, automatic annotations, detect topics, etc.. dynamically → do we create a new revision?!
Notes hangout 21-12-2016 (Roy, Antje, Abi, Mariano, Klaas):
(level 1): semi-automatically annotate at slide level using generic DBpedia ontology (existing vocabulary) + other ontologies. Do before month 15 deliverable D2.3 (demonstrator). Annotate general things, e.g., general topic of slide. At deck-level identify information about accessibility of whole deck → need to identify vocabularies for decks at deck-level. At deck level identify educational topics/content. At deck level we may have multiple ontologies (accessibility, availability, education) At slide level → DBpedia for topics + extra?.
Next level (level 2): In-page semantic annotation + other ontologies + upload/link to/use own ontology?.
TODO Done: check deliverable D2.3 minimal requirements.
Klaas: level 1 (D2.3, month 15, see below) does need semi-automatic annotation - Antje Schlaf and Mariano Rico - instead of working on topic discover/recommendations/automatic annotations in parallel, it should be top-priority as well → for D2.3 we need to give users recommendations for annotations (~semi-automatic), or annotate automatically, e.g., with NLP/LSI and some probability attached to the annotations. Antje Schlaf - Would you recommend any tools/systems you use for recommendations/automatic annotations which can be integrated in SlideWiki? Or is it better to do from scratch following certain algorithms?
Mariano : How will users annotate content? - Extra tab in platform
Mariano : need to agree on ontology - use DBpedia?
Klaas: Need screen/UI design and use cases.
Abi: during plenary: annotate at slide level.
Mariano: First prototype available in 3th month. First annotate → select from list of things/concepts, e.g., people + add name of presenter, or name of people in slide. Buildings, place, etc.. general labels. Specify literal. With this level we can start with recommendation/semantic search, etc...
Klaas: We do iterative and play it safe → minimal level (level 1): annotate at slide level + generic DBpedia ontology. Next level (level 2): In-page semantic annotation + other ontologies + own ontology.
Mariano: I will send proposal with vocabulary for minimal level.
Mariano: Look at deliverable → what is minimal requirement. Only slides. Recommendations? Minimal: List of topics to users → topic of slides
Roy: autocomplete list.
Mariano: yes/good.
Klaas: is list of topics fixed? Why limit if we connect to DBpedia anyways? At least list should be subset of DBpedia.
Mariano: DBpedia has 400 topics - a bit much.
Klaas: Perhaps we have a clever directory structure/selection mechanism → generic topics first, then more specific.
Abi: are DBpedia topics in multiple languages?
Abi: concerned about complexity for users → already have data sources, etc.. Is interest of me (Klaas: is RQ?)
Mariano: not sure if topic modelling will work with few slides.
Klaas: nevertheless: Antje can work on this in parallel → she can already prototype/discover requirements for good topic modelling.
Antje: Good → I already did tests on old slidewiki.org which has thousands of slides. Is recommendation system planned?
Mariano: Recommendation system → As far as I know → for suggesting slides related to topic of slides you are working on.
Antje: Recommendation system as in: automatic annotation.
Mariano: Selecting text in slides. We can have both → also automatic annotation. Not sure if this is in proposal.
Klaas: semi-automatic annotation in slides == level 2. (topic modelling is level 1)
Abi: Also learning objects in slides (e.g? Darya is working on this. Exam mode.When editing slides, e.g., put image in, ask users: is this graph with data in it?
<klaas break - break in notes Intermediate summary: Agreed about 2 levels. In level 1 of slides → annotate general things, e.g., general topic of slide. At deck-level identify information about accessibility of whole deck → need to identify vocublary for decks at deck-level. At deck level identify educational topics/content. At deck level we may have multiple ontologies (accessibility, availability, education) At slide level → DBpedia. >
Roy: map part of our database to RDF model.
Mariano: yes: to use in our search-facility. We need to think what users can ask → the more things we extract from model, the more they can ask.
Antje: Have graph visualisation at deck level - see connections → e.g., slides about Einstein, relativity, etc.. Can we easily ask DBpedia if entities are connected?
Mariano: yes, also number of steps in between, e.g., 2 people in between.
Antje: would be good for users → show what deck is about.
Mariano: is third or fourth level.
Mariano: I will look into ontologies. Abi do you have ideas for ontologies as well.
Abi: yes ideas for accessibility ontology (colleague will be working more on this in the New Year). Mirette Elias at Bonn has been doing something in this area linked to user profiles. Darya Tarasowa is interested from questions .
- Should look at common standards already in use in educational publishing e.g IMS https://www.imsglobal.org/metadata/index.html Dublin Core ; Schema.org http://schema.org/docs/schemas.html which is being extended for accessibility http://www.a11ymetadata.org/ and is mapped to DBpaedia
Roy: Store annotations in DBpedia or different DB.
Mariano: store in model → map MongoDB model to RDF.
Klaas: concerned about performance. Also if we provide SPARQL query endpoint later.
Mariano: indeed takes time. Batch processing during night → MongoDB to Virtuoso → annotations are 24 hours old max → not real time update if someone annotates.
Klaas: we have to work out technical details later on. Concerned about backwards compatibility.
Mariano: perhaps we can generate RDF every half hour.
Antje: need timestamp
Antje: assign probability to annotations → user has prob. 100%. Automatic annotation has less probability.
Abi: semantic representation is related to search results → - SWIK-883Getting issue details... STATUS
Mariano: provide DBpedia topics + entries/text as training data → antje can to tests.
Antje: yes! we can do experiments.
Mariano: we do experiment (in parallel) with DBpedia topics + entries → see if we can do topic suggestions.
TODOs → in 1st sprint - new years resolution. :
Mariano → provide list of topics for slides (level 1)
Klaas: link to confluence page on schema's suggested by Ben. → Use of Metadata Standards
Klaas: Make above TODOs into tasks for 1st sprint.
Klaas → Look at Month 3 deliverable → what is minimal requirement for level 1?
Antje → work in parallel on prototyping/experimenting topic modelling / automatic recommendations.
Antje → take look at topics in DBpedia
Level 1 semantic annotation:
D2.3: SlideWiki annotator module. -SlideWiki component for semi-automatic semantic annotation of content using the ontologies and existing vocabularies.
Related / larger task: T2.4. Semantic annotation, enrichment and recommendation: (Start M1, End M27 ; Lead: UFRJ; Participants: VUA, InfAI, Pupin, UPM, Fraunhofer, ATHENA, SOTON). Enriching educational content with semantic representations helps to create more efficient and effective search interfaces, such as faceted search or question answering. It will also provide customized and context-specific content which better fits user needs. In this task, we will develop and align SlideWiki content with appropriate ontologies (e.g. DublinCore, SCORM, FOAF, SIOC, Schema.org) for representing semantics of OpenCourseWare material. We will use an RDB2RDF mapping approach (e.g. SparqlMap47) for dynamically mapping the existing relational multi-versioning data structure to our semantic model. Providing suitable user interfaces for annotation of content is another goal of this task. We will customize RDFaCE48 semantic editor to support manual annotation of content based on RDFa and Microdata markups. For automatic content annotation and interlinking, FOX and LIMES49 will be employed. We will also take advantage of the generated annotations to create a recommender system to propose related content to authors based on information context and user profile. Building content on top of the existing content will save a considerable amount of time for users and will increase the consistency and integrity of the content.