w3resource

Versioned Datasets Management System with Python

Write a Python program that creates a system for managing versioned datasets with Git-like semantics.

The task involves developing a system to manage versioned datasets with functionalities similar to "Git". This system should allow users to create commits, each capturing a snapshot of the dataset along with a commit message and timestamp. Users should be able to list all commits, view details of each commit, and roll back the dataset to any previous version. This version control mechanism enhances dataset management by enabling easy tracking of changes and restoring previous states when needed.

Sample Solution:

Python Code :

# Import necessary modules
import os
import shutil
import datetime

# Define the DatasetManager class
class DatasetManager:
    # Initialize the DatasetManager instance with the given dataset path
    def __init__(self, dataset_path):
        # Set the dataset path
        self.dataset_path = dataset_path
        # Set the metadata path inside the dataset
        self.dataset_metadata_path = os.path.join(dataset_path, ".metadata")
        # Initialize the current version to 0
        self.current_version = 0
        # Initialize the dataset
        self.initialize_dataset()

    # Method to initialize the dataset
    def initialize_dataset(self):
        # Check if the dataset path does not exist
        if not os.path.exists(self.dataset_path):
            # Create the dataset directory
            os.makedirs(self.dataset_path)
            # Create the metadata directory
            os.makedirs(self.dataset_metadata_path)
            # Create the initial commit
            self.create_commit("Initial commit")
    
    # Method to create a new commit with a message
    def create_commit(self, message):
        # Increment the current version
        self.current_version += 1
        # Create a directory for the new commit
        commit_dir = os.path.join(self.dataset_metadata_path, str(self.current_version))
        # Make the commit directory
        os.makedirs(commit_dir)
        # Write the commit message to a file
        with open(os.path.join(commit_dir, "message.txt"), "w") as f:
            f.write(message)
        # Get the current timestamp
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        # Write the timestamp to a file
        with open(os.path.join(commit_dir, "timestamp.txt"), "w") as f:
            f.write(timestamp)
        # Take a snapshot of the dataset
        self.snapshot_dataset(commit_dir)
    
    # Method to snapshot the dataset
    def snapshot_dataset(self, commit_dir):
        # Define the snapshot directory path
        snapshot_dir = os.path.join(self.dataset_path, str(self.current_version))
        # Copy the dataset to the snapshot directory
        shutil.copytree(self.dataset_path, snapshot_dir)
    
    # Method to get the current version of the dataset
    def get_current_version(self):
        # Return the current version
        return self.current_version
    
    # Method to rollback to a specific version
    def rollback(self, version):
        # Check if the version number is valid
        if version <= 0 or version > self.current_version:
            # Print an error message if the version number is invalid
            print("Invalid version number")
            return
        # Define the path to the commit to rollback to
        commit_path = os.path.join(self.dataset_metadata_path, str(version))
        # Check if the commit path does not exist
        if not os.path.exists(commit_path):
            # Print an error message if the version does not exist
            print("Version {} does not exist".format(version))
            return
        # Remove the current dataset directory
        shutil.rmtree(self.dataset_path)
        # Copy the commit directory to the dataset path
        shutil.copytree(commit_path, self.dataset_path)
        # Set the current version to the rollback version
        self.current_version = version
    
    # Method to list all commits
    def list_commits(self):
        # Initialize an empty list to store commits
        commits = []
        # Iterate over the entries in the metadata directory
        for entry in os.listdir(self.dataset_metadata_path):
            # Define the path to the commit
            commit_path = os.path.join(self.dataset_metadata_path, entry)
            # Read the commit message from the file
            with open(os.path.join(commit_path, "message.txt"), "r") as f:
                message = f.read().strip()
            # Read the timestamp from the file
            with open(os.path.join(commit_path, "timestamp.txt"), "r") as f:
                timestamp = f.read().strip()
            # Append the commit details to the list
            commits.append((entry, message, timestamp))
        # Return the list of commits
        return commits

# Example usage
if __name__ == "__main__":
    # Create an instance of DatasetManager with the dataset path "dataset1"
    dataset_manager = DatasetManager("dataset1")
    # Create a new commit with the message "Add initial data"
    dataset_manager.create_commit("Add initial data")
    # Create another commit with the message "Update data"
    dataset_manager.create_commit("Update data")
    # Print the current version of the dataset
    print("Current version:", dataset_manager.get_current_version())
    # Print the list of commits
    print("Listing commits:")
    for commit in dataset_manager.list_commits():
        print(commit)
    # Rollback to version 1
    dataset_manager.rollback(1)
    # Print the current version after rollback
    print("After rollback, current version:", dataset_manager.get_current_version())

Output:

Current version: 3
Listing commits:
('1', 'Initial commit', '2024-05-21 11:30:05')
('2', 'Add initial data', '2024-05-21 11:30:05')
('3', 'Update data', '2024-05-21 11:30:05')
After rollback, current version: 3

Explanation:

  • Import Modules: Necessary modules (os, shutil, datetime) are imported.
  • Define DatasetManager Class: A class for managing versioned datasets.
  • Initialize Class (__init__ Method): Sets up dataset paths and initializes dataset.
  • Initialize Dataset Method: Creates dataset and metadata directories if they don't exist, and makes an initial commit.
  • Create Commit Method: Increments version, creates a commit directory, writes a message and timestamp, and snapshots the dataset.
  • Snapshot Dataset Method: Copy the current dataset to a snapshot directory.
  • Get Current Version Method: Returns the current version number.
  • Rollback Method: Reverts the dataset to a specified version, with checks for valid version numbers.
  • List Commits Method: Lists all commits by reading messages and timestamps from the metadata directory.
  • Example Usage: Demonstrates creating a 'DatasetManager' instance, making commits, printing the current version, listing commits, and rolling back to a previous version.

Python Code Editor :

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Building a Rule-Based Chatbot with Python and Regular Expressions.
Next: Synthetic Data Generation Tool in Python.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.