CONCURRENCY CONTROL FROM CENTRALIZED TO DISTRIBUTED DBMS

CENTRALIZED TO DISTRIBUTED DBMS

            Compared to a centralized database system a distributed database system is much more complex in implementation and concurrency control. We had to augment the simulation model used in the centralized simulator with flexibility to simulate the most important aspects of a distributed database system. We introduced three new concepts to the distributed simulation model:-

1. Number of sites that the database is distributed to.

2. Locality of data.
3. Type of network.

             In this model we have only one global transaction manager object. This object is responsible for creating transactions and operations to the underlying scheduler and data manager objects.

TRANSACTION MANAGER

            The Transaction Manager (TM) is responsible for simulating the transactions that generate operations to the schedulers.  It is implemented as one module that contains information about all active transactions which are currently carried out.  Every transaction is assigned to one of the nodes in the network when it is created. In this way, although we only have one TM object, we simulate that every node has one transaction manager, and every transaction has one initiating node.

            In the simulation model we use the number of concurrent transactions as the main criteria for load on the system. In every simulation we have a fixed number of active transactions. This is a good approximation to a system with constant load. This is called multiprogramming level (MPL). Every time a transaction finishes, a new one is started (after some random delay).

 ADDRESS SPACE

            The database contains a fixed number of data elements. Each data element is addressed by a number in the interval [0, max]. The data elements are de-clustered on all the nodes sometime in the network. How the data elements are assigned to the nodes is specified in the simulator’s configuration file.

 OPERATIONS

            When a transaction is created it is given “a life on its own”, and starts to generate operations to the schedulers. Each operation is a Read, Write, Commit or Abort operation. If the operation is a Read or Write, a data element has to be accessed. One of the main advantages with a distributed scheduler is the possibility to store data at the node that most frequently accesses it. That’s why, the transaction selects the data element in most cases to be a data element stored on the same node as the node where the transaction originated. This locality of the data accesses can be configured individually for each node. Every time an operation finishes successfully, the transaction, after a short delay, generates a new operation or it decides to end the transaction by sending a commit or abort operation.

             The previous implementations simulated a mix of long and short transactions. Every time a new transaction was created there was a certain probability that it was a long transaction. The problem with long transactions is that they stay for a long time in the system and occupy more resources. This causes frequent conflicts with other transactions, and often results in aborts of the long transactions. For fully functional heavy loaded systems, the long transactions very often get aborted several times before they manage to finish. Results from our simulations with the centralized version of the simulator, showed that even if the system is started with very long transactions, due to aborting and restarting of long transactions, when ending the simulation there is a much larger amount of long active transactions than short active transactions.

             To avoid this situation, we have changed the simulation model in a way that a fixed percentage of all active transactions are long. This makes balancing the resource usage between long and short transactions more manageable.                                                                                       

SCHEDULER

            The scheduler module in this system is a black box , it communicates with the other modules only via a well-defined interface. This makes it easy to replace it with schedulers that use other techniques. In the distributed version of the simulator, two-phase locking and timestamp ordering schedulers are implemented.

            The scheduler receives an operation from the transaction manager, and processes it according to its scheduling technique. When the scheduler decides that an operation can be sent to the data manager, the data manager is called. When the operation has completed, the scheduler will be notified by a call from the data manager. The following distributed schedulers are implemented:-

1.      Strict Two-Phase Locking Scheduler:

A nice feature with non-replicated distributed databases is that each local scheduler can schedule the data accesses as if it was a centralized scheduler. But of course there are also some problems that are more difficult to solve in the distributed context than for a centralized scheduler. For the 2PL scheduler the main problem which has to be solved is deadlock

Strict Timestamp Ordering Scheduler:

We have implemented a strict TO scheduler. Although deadlocks are no problem here, we have another “global problem”. Assigning monotonous increasing unique timestamps to transactions. In a real implementation this could be done by concatenating a local timestamp counter with the node number.

DATA MANAGER

            The data manager simulates the physical database storage manager. Each node has a data manager, which operates independently of the other data managers in the system. We have used a very simple and straightforward model, where the data manager only consists of a simulated buffer cache and a disk required to perform the aforesaid task.

            Time delays are used to simulate operations against the data manager in the process. The time needed to execute a disk operation is the sum of the time that operation has to wait for the disk to become idle (the time needed to execute previous operations submitted to the disk), and the time it takes to read or write the page. We have chosen a very simple algorithm for disk scheduling: first-come-first-served (or first-in-first-out). A read operation from the cache takes constant time, and can be served immediately to optimized the performance.

            Logging is not implemented separately. The time and resources needed for logging is considered to be included in the time needed to do the write operations. However, when an abort occur, we write back the before images.

Post a Comment

4 Comments

  1. Time stamp is required for consistency

    ReplyDelete
  2. Centralized and distributed database features are defined very well

    ReplyDelete
  3. perfect combination of centralized and distributed dbms

    ReplyDelete