Move stable data to slidewiki.org

Background

Going forward we would like to get everyone using a single SlideWiki deployment, that is slidewiki.org. The deployment on stable.slidewiki.org has been used for the trials, and includes decks, users, groups, as well as activity data related to the trials. We are actively going to be promoting the new site, and it would be highly desirable to have as much as possible moved to slidewiki.org.

Proposed Tools and Methods

On a high level, this task of moving data can be split into parts, with most important first (ordering is up to debate):

  1. Move just the decks and slides content and most metadata → user information is lost, all content is owned by some other user
  2. Copy the tags already in decks and slides → not as important as the rest, but easy to attain
  3. Copy the creator and contributors information of decks and slides → this includes also copying users to slidewiki.org: all points after this one are very easy to perform
  4. Copy the user activities related to deck/slide content and metadata
  5. Copy the deck history (includes deck/slide revisions, the history tab content)
  6. Copy the user edit rights for decks, including user groups memberships
  7. Copy user comments in content
  8. Copy user ownership information for uploaded files

Copy files and update URLs tool

Regardless of the above, the data copying process needs also address the user uploaded files and the respective URLs that point to them in the slide content being copied. In D1.7 SlideWiki mirroring and update propagation there is a reference of a script/tool named "update_urls" that would handle fixing any URLs pointing to the source deployment host (in this instance, stable.slidewiki.org) and updating them to point to the respective files in the target host (slidewiki.org).

Simply copy decks and slides: slidewiki-cli

With slidewiki-cli tools we already have a way to copy over deck structure and slide contents between any two deck service deployments (point 1 in list). This does not include any revisions, forking or history metadata, creator or contributor information, tags, etc. If we only use this, then:

  1. The owners of decks on stable that were copied would not have edit rights → we would need to grant them manually upon request
  2. They would also not be listed in the respective "My Decks" page of the original authors (pending ownership transfer functionality, or "my decks" page enhancements)
  3. The decks/slides on slidewiki.org would have new ids with no simply defined relationship to the old ids
  4. Any forks would not be tracked as such

After deploying some tag-service updates, it would be possible to also attain point 2 in the list (tags) using slidewiki-cli tools, but not before.

Under this proposal, we would have to attain a list of decks to copy, and perform the copying using some users credentials. Alternatively, we could ask for a list of users that want all their decks copied. In general, some kind of helper script would need to be additionally created.

Copy everything: slidewiki-data-tools

With slidewiki-data-utils tools currently under development it is possible to copy all slidewiki data from stable to slidewiki, in such a way as to handle ids in a predictable way and keeping all metadata.

The tool works directly with the database, so its use does not require updating of new versions in any microservices. Its main functionality we would use in this task is the command shiftids that:

  • Updates all sequential ids in the respective collections in a database so that an offset is added to them
  • Updates all fields in any collection in the database that reference any other sequential id, by applying the same offset

With a properly chosen offset, merging of those collections with the target database will not result to any conflicts.

Furthermore:

  1. It will be easy to locate a deck from stable that was copied over to slidewiki.org: simply apply the offset chosen
  2. All user attributions will be maintained, which means users would be able to locate those decks in their "My Decks" page
  3. Users would also retain any edit rights or group memberships, and no manual edit rights would need to be handled by a single user
  4. Forks of decks would be also kept intact
  5. All other data, like history, revisions, comments, and activities can be easily included in the database merging

More specifically, the database collections that would need to have their IDs shifted are as follows (snapshot of 22/1/2018) :

collectionmax_id in stablemax_id in slidewiki.orgsuitable offsetunique fields
decks11589105836200000
slides51539696940800000
usergroups9540065000
users28951562720000email, username
tags1618961000tagName

Updating the ids of the above collection would also result in updating some fields in the following additional collections (the tool already handles that for all). These collections do not have sequential ids, so they are using the ObjectId provided by mongodb. We can therefore merge them with the collections in slidewiki.org without issue.

  • deckchanges
  • discussions
  • activities
  • media

The notifications collection could also be updated, but since that data is transient by nature, we don't have to bother with them.

As noted in the table above, the users and tags collections cannot be updated with a simple ID shifting, as they also include some other unique fields that would need to be checked to ensure there are no duplicates after merging.

For tags, this is not a problem, because their ids are not referenced in any other collection. We will simply skip copying the tags collection, and copy only the tagName data as it is included in the decks collection. The platform and deck service already can handle tags that are not part of the tags collection and automatically populate the tags collection when needed during normal operation. ( /tags/upload REST API in tag service ).

For users, this is quite different. We need to match users between the two databases by using the email field, as it's unique and the only reliable method to tell that two users are the same person. We also need to check for duplicate usernames when the email is not the same, and decide on how to handle such cases.

The tool is designed in such a way as to connect to two databases and change the source database in-place, so that it can be prepared to be merged to the target. The tool does not require a connection with live databases or any microservice (like slidewiki-cli does), and can work offline, using mongodb dumps for both the source and target database loaded in a some mongo service not in actual use. The idea is that after manipulating the data on the source database those could be dumped and restored to the target database so that the records can be appended to the collections. This allows for the following type of workflow:

  1. create a dump of the data on slidewiki.org for reference
  2. move stable into maintenance mode, and create a dump of its data
  3. apply the process that updates ids and references in collections using the data dumps of stable and slidewiki.org → result will be updated stable data dump
  4. restore updated data dump on stable
  5. do some more testing on stable, at the same time letting people know that any edits they perform might not persist
  6. bring slidewiki.org in maintenance mode so that the counters can be updated, and the stable data can be merged

User copying from stable to slidewiki.org

As almost all data in stable contain references to users, we need to decide on how to copy that data while at the same time retaining user information. Comparing data dumps of stable and slidewiki.org on 22/1/2018 reveals the following:

  • of the 2887 users on stable,:
    • 2395 users are registered in slidewiki.org with the same email
      • of those, 2331 users share the same id as well, so any references to those users in stable data need no updating
      • the remaining 64 occur across databases with different ids, so those should be updated in stable data to have the same ids as in slidewiki.org
    • of the 492 users that are not matched with any user in slidewiki.org using the email:
      • 25 users occur across databases with the same username
  • data on stable only references 345 users (content, metadata, activities, etc)
    • of them 264 are not registered with slidewiki.org
      • 21 occur across databases with the same username

The tools include a command, matchusers that update the user ids of those users (64 in the above breakdown) in source (stable) that have matching emails in target (slidewiki.org), and also updates the references thereof. During this process a user in stable may be matched to a user in slidewiki.org that has the same id as some other user in stable. For this reason, before issuing the matchusers command we would first shift all user ids by an offset, using the shiftids command.

After this process is complete, there are still going to be 492 users in stable that are not registered in slidewiki.org. For these users we may decide to:

  1. copy them to slidewiki.org:
    1. we keep them in the stable database dump we are going to load on slidewiki.org
    2. there are still going to be 25 users in slidewiki.org with duplicate usernames, so we could patch them e.g. by appending a '_' character to one of them until we have no duplicates
    3. we should re-check for duplicate emails or usernames, as some user on stable may have registered on slidewiki.org while we were preparing the data dump from stable → there are scripts already for that in the user-service
  2. do not copy them to slidewiki.org:
    1. it is not a requirement to migrate any credentials from stable to slidewiki
    2. we should somehow handle the reference to those users from the other collections on stable, Handle deactivated users has some specifications that may be useful, but nothing is implemented as of yet

Since most data refers to users not in slidewiki.org, it would make sense and may be also somewhat safer to do copy those users over to slidewik.org


Future of stable

Since the goal of this task is to get everyone (including partners) using slidewiki.org, we also need to decide what to do about stable.slidewiki.org. Whatever the decision, it only makes sense that any data on stable would not be generally retained, much like we do with data on experimental or testing.

Alternatives are:

  • Keep stable as a mirror of slidewiki.org:
  • Retire stable altogether, and use the machine hosting it in some other capacity:
    • we could use current stable machine to host either testing or experimental going forward and cut down on cloud host costs

Action items

  • If simple copy is going to be used, create a script/process that organises such a copy process (deck ids, users, etc)
  • If full copy is going to be used, decide what to do about migrating users and data with email on stable that are not on slidewiki.org
  • Script a process for copying uploaded files from source (stable.slidewiki.org) to target (slidewiki.org) file systems, that also handles conflicts
  • Script a process that checks content for URLs to source host (in this case, 'stable.slidewiki.org') and updates them to target (slidewiki.org)
  • Decide on stable re-purposing after the merging is complete