THROUGHPUT VS NUMBER
OF NODES AND MPL
Distributed DBMS are used both to
increase availability and performance. The performance should increase with
in-creasing number of nodes. We have simulated with 1, 2, 4, 10, 2 0 and 50
nodes, and for all of these node configurations with a MPL of 1, 5, 10, 100,
500 and 1000.
Here we assume that all nodes have an
equal amount of the data in the database, and the transaction load is balanced.
That means that on average the same number of transaction are started on each
node. We also assume that the probability of the data being accessed to be
located at the home node of the transaction to be 80%. If not on the home node,
it is located on a remote node, each node having the same probability. The network
model used in this simulation is a LAN.
The throughput increases with
increasing number of nodes for both timestamp ordering and two-phase
schedulers.
A more interesting result is that the
throughput for the TO scheduler is higher than 2PL scheduler. This was not
actually expected. In the comparison done with the centralized version of the
simulator, the 2PL scheduler performed better than the TO scheduler in most
cases. In a real implementation of a distributed scheduler it would have been
more likely to experience this result, because the 2PL has an overhead due to
the global deadlock detection. But in our implementation, the global deadlock
checking is done without any delay on the participating transactions.
For the TO scheduler, we note that the
curves for a MPL of 500 and MPL of 1000 are almost identical. We have here
reached a limit for maximal throughput in the system with the available
resources, where the main bottleneck is the disks.
For the 2PL scheduler, we reach the
maximal throughput for a much lower load. The figure shows that we get the
maximal throughput near 100 active transactions.
For 500 and 1000 active transactions,
the throughput is much lower. With only one transaction in the system, the
throughput is expected to decrease when we add several nodes. This can be verified by examining the
figures. When we only have one active transaction, the line makes a dip when we
introduce more than one node in the network. The reason is that some of the
data accessed have to go through the network.
Another comment we want to make to
these figures, is that with more nodes, and a network with high capacity
between them, the databases get a very good increase in performance compared to
a centralized database. The main reason is higher disk bandwidth because of
more disks. This, of course, requires that data are evenly accessed and that
most accesses go to the local disk without going through the network. Given
these characteristics, it seems like a distributed database consisting of
rather inexpensive nodes based on off-the-shelf technology very well can
compete with more expensive parallel database machines.
Response
Times
I present the response times
for the two schedulers. The results are taken from the same simulations as
those presented in the previous subsection. We have only used the short
transactions when making these curves. The reason is that it is usually the
small transactions that have the most stringent requirements regarding the
response time.
If we compare the two figures we reach
the same conclusion as in the previous subsection: The TO scheduler performs
better than the 2PL scheduler for short transactions.
We are now able to answer the question why the TO has higher throughput than
the 2PL scheduler. The length of a transaction influence very much on the abort
probabilities. For short transactions, the abort probability is rather small.
For long transaction, this is not the case. Especially if we have a high number of parallel transactions, we get a lot of conflicting operations, and many aborts. What is important to notice, is that the TO produce more aborts than the 2PL scheduler. This is as expected, since the TO doesn’t handle conflicts as well as the 2PL scheduler. The 2PL is more likely to let transactions wait for the resource, while the TO aborts more often when a conflicting operation occurs.
Then why does the TO scheduler perform so much better than the 2PL scheduler?
The main reason is that we have a mix of long and short transactions. The long
transactions are the reason why the 2PL performs worse than the TO. There are
two reasons for this:-
1. With many
active transactions, there will be a lot of conflicts, and with long
transactions in the system, they will lock many data elements. The 2PL will let
a short transaction wait for the long transaction to finish. This makes a lot
of short transactions being locked for long time. This is the main reason why
the response time is so much longer for short transaction with the 2PL
scheduler, compared to the TO scheduler. It is also the reason why the 2PL
reaches its maximum throughput earlier than the TO.
2. Another effect of the locking is
that long transactions have a higher probability of finishing without being
aborted. This means that in average there are more resources occupied for long
time by long transactions in the 2PL scheduler than in the TO scheduler. From
the simulation logs we noticed that the percentage of the finished transactions
that where long was much higher for the 2PL than for the TO.
A conclusion that can be drawn from this is that the 2PL treats long
transactions more fair than the TO scheduler. The cost of this is that the
maximum throughput is not as good. To check this we ran the same simulations
with only short transactions. This time both schedulers had almost comparative
performance, as expected.
For a mixed load, i.e. long and short transactions, we keep the load ratio
constant, 20% long and 80% short transactions. However, the production ratio
changes with increasing total load. Because the long transactions are more
vulnerable than short transactions to data contention under the TO scheduler,
the relative production of long transactions is decreasing with increasing MPL.
Throughput
with Different Data Placement
Also important is to measure
throughput when we change the probability of accessing data on home node. Here
we used the schedulers with a configuration with 10 nodes and a MPL of 1, 20
and 1000, with a LAN network.
First, we saw how the throughput
changes with a home node probability ranging from 0% to 100%. As expected the
curves shows that the more of the data accesses that go to local data, the
better is the performance of the system.
For the TO simulations, the curves for a MPL of 1 and a MPL of 20 are easy to explain. For a MPL of 1 the through-put only increase slightly when increasing home node probability. The reason is that more and more of the operations can be performed locally and do not have to go across the network. For a MPL of 20 we see a very good effect from increasing the home node probability. The system goes from a state where the network is the bottleneck, to a state where the network is contributing very little to the transaction delay.
The curve for a MPL of 1000 might look a bit strange. For 0% home node probability the network is the bottleneck. This is because we simulate the network as a shared medium LAN. For 10% home node probability, the problem has already changed from the network being the bottleneck, to the disks being the bottleneck. From Figure 8, we see that for the 2PL simulation, the results are similar.
CONCLUSION
The results of the comparison between the timestamp ordering scheduler and the
two-phase locking scheduler can be summarized as follows:
a.
With a mix of long and short transactions, the TO
scheduler have a higher throughput than the 2PL scheduler.
b.
With only short transaction, the two schedulers
perform almost identical.
c.
The TO scheduler have much higher abort probabilities
than the 2PL scheduler.
d.
The 2PL scheduler is in favor of long
transactions, and the number of long transactions that manage to finish
successfully is much higher for the 2PL scheduler.
e.
The network is not the bottleneck for a reasonable
load. Only for heavy load and a slow network this severely affects performance.
f.
2PL, the standard technique used for centralized
DBMSs, proves to perform rather poorly for distributed systems, whereas
timestamp ordering based protocols in their various forms seem to provide the
best overall performance.
- In
2PL, and other locking techniques as well, the deadlock prevention or
detection in a distributed environment, which is much more complex and
costly .
- Timestamp
ordering techniques (TO) avoid deadlocks entirely.
- Basic
TO (BTO) usually shows better overall performance in a distributed
environment than 2PL.
- ORDER
outperforms both 2PL and BTO, i.e. low network latency and an efficient
implementation of the total ordering algorithm. For high network
latencies, ORDER appears to be a rather disadvantageous approach.
- PREDICT shows basically the same advantages ORDER does.
10 Comments
This topic is very important
ReplyDeleteTechnical information
ReplyDeleteGood article very informative
ReplyDeleteconclusion may be used for future projects
ReplyDeleteconclusions are very much appreciated.
ReplyDeleteAppreciated
Deleteuseful blog
ReplyDeleteVery informative
ReplyDeleteVery good
ReplyDeletenice
ReplyDelete