Data Migration
This document details about migration of data from cloud absolute paths stored in the database with relative paths OR with new CSP absolute paths.
This document details about the migration of data with respect to
Replace existing absolute paths in database with relative paths.
Migration of data while changing the CSP provider
Example: Moving from Azure to AWS service provider.
Below are the data that currently(upto release-5.2.0) store cloud specific absolute URLs that are to be migrated to relative paths OR with new CSP provider absolute URLs:
neo4J fields based on objectType:
Note: Above data is available as configuration(neo4j_fields_to_migrate) in 'csp-migrator' job.
Cassandra data that will be migrated
hierarchy_keystore
content_hierarchy
hierarchy
content_keystore
content_data
body
content_keystore
question_data
body, editorState, answer, solutions, instructions, media
questionset_keystore
questionset_hierarchy
hierarchy
dialcodes
dialcode_images
url
dialcodes
dialcode_batch
url
ECAR (needs live nodes republishing)
streamingUrl (needs regeneration based on new Media service provider)
Reference diagram to know how the migration of existing data with CNAME(storing relative path DB)
Flink Jobs used for migration:
csp-migrator: For migration of data in eno4j and cassandra tables.
live-node-publisher: For republishing of live nodes (Content and Collection).
live-video-stream-generator: For regeneration of streamUrl using new Media service.
cassandra-data-migration: For migration of data in any cassandra table, column wise.
Note:
Jenkins Job 'Neo4jElasticSearchSyncTool' is used to insert the events into 'csp-migrator' job input topic. 'csp-migrator' job will further insert topics into 'live-node-publisher' job and 'live-video-stream-generator' jobs based on conditions. Jenkins job command: migratecspdata
'cassandra-data-migration' job is to be used for migration of 'dialcode_images' and 'dialcode_batch' cassandra tables in 'dialcodes' keyspace.
Run the migration flink jobs in a separate kafka setup with increased processing ability and storage for storing all kakfa events and logs.
Increase the infra for neo4j. Also, increase the neo4j max heap size in neo4j conf file.
Increase infra for logstash, search-indexer flink job and ElasticSearch to handle the neo4j transaction data sync.
The content migration should execute in the below order only. Otherwise there is a chances of migration failure because of dependent content is not yet migrated. more details
Migration Steps
Before running the migration steps, go here and run all the queries and keep the output to compare after migration.
Go to Deploy/KnowledgePlatform/Neo4jElasticSearchSyncTool jenkins job.
Select the command as migratecspdata
And copy and paste the parameter one by one in parameter section in jenkins deployment job.
1
Video Asset
--graphId domain --objectType Asset --mimeType video/webm,video/mp4 --delay 2000
2
Other Asset
--graphId domain --objectType Asset --delay 2000
3
Video Content
--graphId domain --objectType Content,ContentImage --mimeType video/mp4,video/webm --delay 2000
4
Plugin, Youtube Content, PDF Content,EPUB Content
--graphId domain --objectType Content,ContentImage --mimeType application/vnd.ekstep.plugin-archive,video/x-youtube,application/pdf,application/epub --delay 2000
5
AssessmentItem
--graphId domain --objectType AssessmentItem --delay 2000
6
ItemSet
--graphId domain --objectType ItemSet --delay 2000
7
H5P Content
--graphId domain --objectType Content,ContentImage --mimeType application/vnd.ekstep.h5p-archive --delay 2000
8
HTML
--graphId domain --objectType Content,ContentImage --mimeType application/vnd.ekstep.html-archive --delay 2000
9
ECML
--graphId domain --objectType Content,ContentImage --mimeType application/vnd.ekstep.ecml-archive --delay 2000
10
Remaining Contents
--graphId domain --objectType Content,ContentImage --delay 2000
11
Collection
--graphId domain --objectType Content,ContentImage,Collection,CollectionImage --mimeType application/vnd.ekstep.content-collection --delay 2000
12
dialcodes.dialcode_images
Push below event to "{{env}}.cassandra.data.migration.request" kafka topic
13
dialcodes.dialcode_batch
Push below event to "{{env}}.cassandra.data.migration.request" kafka topic
Note:
If the 'migratecspdata' command stops before reaching 100%, please wait for 'csp-migrator' job lag to reach 0 before triggerring the same 'migratecspdata' command again.
ECML migration can be triggered only after Asset, AssessmentItem and ItemSet migration is completed.
For Collection migration to be triggered, pre-requisites are:
a. Question, QuestionSet migration should be completed (as part of Inquiry BB).
b. All assets and contents are to be migrated successfully.
c. All migrated contents data should be synced to ElasticSearch.
Migration status: migrationVersion of the node object
1.Neo4J & Cassandra data migration started
no version => Data migration started for each Neo4j node. It will migrate the Neo4j data and Cassandra data migration failed for the object
1.0 => Success: Neo4j data and Cassandra data migration completed for the object(node)
0.1 => Fail: Data migration is failed for the Neo4J data or Cassandra data of the specific node(identifier is the key to know for which node it failed. We can check the logs of the service to know the reason of failure)
0.5 => Skipped: migration skipped for the object
2.ECAR Generation(after previous step of Neo4J & Cassandra data migration success)
1.1 => Success: Neo4j data and Cassandra data migration completed for the object and ECAR is republished.
0.2 => Fail: Neo4j data and Cassandra data migration completed for the object. But, ECAR republish has failed.
3.Video streaming generation (after previous step of ECAR generate is success)
1.2 => Success: Neo4j data and Cassandra data migration completed for video type of asset/content and streamingUrl is regenerated successfully.
Verification of migration steps:
More details of verification steps are added in the below confluence wiki
CNAME URL Configuration:
'cloudstorage_base_path' private/public repo variable will be updated with CNAME URL ONLY.
'valid_cloudstorage_base_urls' private/public repo update with CNAME URL along with Cloud storage BLOB URL
Update of 'dial_service_schema_base_path' private repo variable with CNAME if it is having blob URL.
'sunbird_cloud_storage_urls' public/private devops repo variable update with CNAME URL addition.
Restart all services and jobs (Services: Assessment service, Content Service, Taxonomy Service, Learning Service, DIAL Service. Jobs: asset-enrichment, content-auto-creator, content-publish, qrcode-image-generator, search-indexer)
Run Sync tool using 'syncbyobjectType' command to sync all assets, contents and collections from neo4j to ElasticSearch with CNAME URLs.
If CNAME is configured post migration, republishing of all Live Contents and Collections is to be triggered for ECARs to be updated with CNAME URLs.
Last updated