An Overview of Distributed Database Systems: Architecture, Replication, Fragmentation, and Client-Server Computing

QuantumO0O

22 Jan, 2023

Distributed Database Systems

A distributed database system is a collection of multiple, interconnected databases that are spread across different locations and communicate with each other over a network. The goal of a distributed database system is to make it appear as if all the data is stored in a single location, even though it is actually distributed across multiple physical locations.

Importance of Distributed Database Systems

Distributed database systems are becoming increasingly important as more and more organizations require access to large amounts of data that is distributed across multiple locations. They can improve performance, scalability, reliability, and data availability, and can also help organizations to better manage and utilize their data resources.

Characteristics of Distributed Database Systems

Data Transparency

Distributed database systems provide data transparency, which means that the users and applications are unaware of the physical location of the data. They can access the data as if it were stored in a single location.

Concurrent Access

Distributed database systems allow multiple users to access the same data at the same time, without interfering with each other.

Distribution Independence

Distributed database systems are designed to be distribution-independent, which means that the data can be distributed across different locations without affecting the way it is accessed or used.

Failure Transparency

Distributed database systems are designed to be failure-transparent, which means that they can continue to operate even if one or more of the databases or network connections fail.

Advantages of Distributed Database Systems

Improved Performance

Distributed database systems can improve performance by distributing the workload across multiple databases and servers. This can help to reduce the amount of data that needs to be stored in a single location and can also help to reduce the amount of time it takes to access and process data.

Increased Scalability

Distributed database systems can increase scalability by allowing organizations to add more servers and databases as their data needs grow. This can help organizations to better manage and utilize their data resources, and can also help to reduce the amount of downtime that is caused by system overloads.

Improved Reliability

Distributed database systems can improve reliability by distributing the data across multiple locations and by providing built-in redundancy and failover capabilities. This can help to reduce the amount of downtime that is caused by system failures and can also help to protect against data loss.

Increased Flexibility

Distributed database systems can increase flexibility by allowing organizations to store and access data in different locations. This can help organizations to better manage and utilize their data resources, and can also help to reduce the amount of downtime that is caused by system overloads.

Improved Data Availability

Distributed database systems can improve data availability by allowing organizations to store and access data in different locations. This can help organizations to better manage and utilize their data resources, and can also help to reduce the amount of downtime that is caused by system failures.

Disadvantages of Distributed Database Systems

Increased Complexity

Distributed database systems can be complex to set up, configure and maintain, which can increase the cost of ownership.

Increased Maintenance

Distributed database systems require more maintenance than centralized systems, due to the need to coordinate updates and ensure data consistency across multiple locations.

Increased Cost

The cost of a distributed database system is generally higher than that of a centralized system due to the need for additional hardware, software, and networking components.

Network Latency

Distributed database systems can be affected by network latency, which is the delay that occurs when data is transmitted over a network. This can be an issue in environments where the network connections between the databases are slow or unreliable, and can lead to slow performance and increased downtime.

Three different database system architectures

No database sharing architecture

In this type of architecture, each application has its own dedicated database that is not shared with any other applications. This type of architecture is often used in smaller organizations where there is only a need for a small amount of data and where there is no need for data sharing or collaboration between different applications or departments.

A networked architecture with a centralised database

In this type of architecture, all the data is stored in a centralised database that is accessed by multiple applications over a network. This type of architecture is often used in larger organizations where there is a need for data sharing and collaboration between different applications or departments. The centralised database is responsible for maintaining data consistency and integrity, and for providing access to the data for multiple applications.

A distributed database architecture

In this type of architecture, the data is distributed across multiple databases that are located in different locations and connected by a network. Each database is responsible for maintaining data consistency and integrity, and for providing access to the data for the applications that are connected to it. This type of architecture is often used in large organizations with a high volume of data, where there is a need for high availability, performance and scalability.

Types of Distributed Database Systems

Centralized Database Systems

A centralized database system is a traditional database system where all the data is stored in a single location. The data is accessed by users and applications over a network.

Partitioned Database Systems

A partitioned database system is a type of distributed database system in which the data is divided into smaller, more manageable pieces called partitions. These partitions can be stored on different servers or in different locations, and can be accessed independently of each other.

Replicated Database Systems

A replicated database system is a type of distributed database system in which the data is copied or replicated across multiple servers or locations. This provides built-in redundancy and failover capabilities, and can help to improve data availability and reduce downtime.

Key issues involving DDBMS

Data consistency: One of the key issues in a distributed database management system (DDBMS) is ensuring data consistency across all the databases that make up the system. This involves ensuring that all the databases have the same data and that any updates made to one database are reflected in all the other databases.

Data integrity: Maintaining data integrity in a DDBMS is important to ensure that the data is accurate, complete, and consistent. This involves implementing checks and constraints to ensure that data is entered correctly and that it meets certain standards.

Data replication: In a DDBMS, data is often replicated across multiple databases to ensure that it is available even if one or more databases fail. This can be a complex process, and requires careful management to ensure that the data is consistent and up-to-date across all the databases.

Concurrency control: In a DDBMS, multiple users can access the same data at the same time, which can lead to conflicts and inconsistencies. Concurrency control techniques are used to ensure that these conflicts are resolved in a consistent and predictable way.

Query optimization: In a DDBMS, it is important to optimize queries to ensure that they are executed efficiently and that the data is accessed quickly. This involves analyzing the query and selecting the most efficient method for accessing the data.

Security: DDBMS have to ensure security of the data stored in different locations and over the network, which can be a complex task. Ensuring data privacy, access control and data encryption are the key security issues.

Network communication: In a DDBMS, data is often transmitted over a network, which can introduce network latency and other communication-related issues. These issues need to be addressed to ensure that the data is transmitted quickly and reliably.

Scalability: DDBMS should be able to handle a large amount of data and support a large number of users and transactions. This can be a challenge, as the system needs to be able to scale to meet the growing needs of the organization.

Some Common Connection Structures of DDBMS

Client-Server: A client-server connection structure is the most common in distributed database systems. In this structure, the clients are the applications that access the data, and the servers are the databases that store the data. The clients send requests to the servers, which then process the requests and send back the results.
Peer-to-Peer: In a peer-to-peer connection structure, all the nodes in the network are equal and can act as both clients and servers. This type of structure is more decentralized and can be useful in cases where there is a need for data sharing and collaboration between different applications or departments.
Ring: In a ring connection structure, the nodes are arranged in a ring shape, and data is passed from one node to another in a sequential manner. This type of structure is useful in situations where there is a need for data replication and failover capabilities.
Mesh: A mesh connection structure is a network topology in which each node connects to an arbitrary number of neighbours. This type of structure is often used in situations where there is a need for high availability and performance.
Tree: A tree connection structure is a network topology in which the nodes are organized in a hierarchical manner with a central hub node as the parent and other nodes as children. This type of structure is often used in situations where there is a need for data aggregation and reporting.
Hybrid: A hybrid connection structure is a combination of two or more of the above-mentioned structures. It is often used when there is a need for a combination of different functionalities, such as load balancing, data replication, and data distribution.

Distributed Transaction

A distributed transaction is a type of transaction that involves multiple databases or other resources that are distributed across different locations. The goal of a distributed transaction is to ensure that all the changes made to the different resources are committed or rolled back as a single unit, so that the data remains consistent and accurate.

A distributed transaction typically involves the following steps:

The transaction begins by sending a request to start a new transaction to the transaction manager, which is a specialized software component that coordinates the distributed transaction.
The transaction manager then sends requests to the different resources involved in the transaction, such as databases or other systems, to reserve or lock the necessary resources.
The different resources then process the requests and return the results to the transaction manager.
The transaction manager then evaluates the results and decides whether to commit or rollback the transaction.
If the transaction manager decides to commit the transaction, it sends a commit request to all the resources involved in the transaction.
If the transaction manager decides to roll back the transaction, it sends a rollback request to all the resources involved in the transaction, undoing any changes that were made.
The transaction is then considered complete and the resources are released.

Distributed transactions can be complex to manage and require specialized software and infrastructure. However, they are important for ensuring data consistency and integrity in situations where data is distributed across multiple locations.

Design of Distribution Database

Data Replication

Data replication is the process of copying data from one database to one or more other databases. The goal of data replication is to ensure that multiple copies of the same data are available in different locations, providing a way to improve data availability, reduce data loss, and improve performance.

There are several types of data replication:

Master-slave replication: In this type of replication, one database is designated as the master and the other databases are designated as slaves. The master database receives all the updates and the slave databases replicate the updates from the master.
Multi-master replication: In this type of replication, all the databases can act as both masters and slaves. Each database can receive updates and propagate them to other databases.
Snapshot replication: In this type of replication, a snapshot of the data is taken at a specific point in time and replicated to other databases.
Transactional replication: In this type of replication, the updates are replicated to other databases as soon as they are made to the master database.

Data replication can be done in real-time or periodically. Data replication can be done synchronously or asynchronously. Synchronous replication means that the updates are made to the other databases immediately after they are made to the master database. Asynchronous replication means that the updates are made to the other databases at a later time.

Data replication is an important feature of distributed database systems and can help to improve data availability and reduce data loss in case of a failure. However, it can also increase the complexity of the system and the management overhead.

Advantages of data replication in a DDBMS

Improved data availability: By replicating data to multiple locations, data can still be accessed even if one location is unavailable.
Improved performance: By replicating data to multiple locations, the load on a single location can be reduced, which can improve performance.
Improved data security: By replicating data to multiple locations, the risk of data loss due to a single point of failure is reduced.
Improved disaster recovery: By replicating data to multiple locations, it can be used for disaster recovery in case of a failure in the primary location
Better scalability: By replicating data across multiple servers, it allows for better scalability, which is essential for growing organizations.

Disadvantages of data replication in a DDBMS

Increased complexity: Replication can increase the complexity of the system, making it harder to set up and maintain.
Increased cost: Replication can increase the cost of the system, as it requires additional hardware and software.
Increased network traffic: Replicating data across multiple locations can increase the network traffic and the load on the network.
Data inconsistency: If the replicated data is not updated consistently, it can lead to data inconsistency, which can cause confusion and errors.
Security risks: If the replicated data is not properly secured, it can be vulnerable to unauthorized access and data breaches.

Replication can be Classified

Complete replication and selective replication are two different types of data replication that can be used in a relational database.

Complete replication

In this type of replication, all the data from the primary database is replicated to the secondary databases. This means that all the tables, rows, and columns are replicated, regardless of whether they are needed or not. Complete replication is useful when all the data is needed at all locations and for disaster recovery purposes.

Selective replication

In this type of replication, only specific tables, rows, or columns are replicated to the secondary databases, based on the needs of the application or organization. Selective replication is useful when only certain data is needed at a specific location, and it can also help to reduce the amount of data that needs to be replicated, and the storage and network resources required.

Data fragmentation

Data fragmentation is the process of decomposing a relation (a table in a relational database) into smaller, non-overlapping component relations. The goal of data fragmentation is to improve the performance, scalability, and availability of the database.

There are several reasons why we might need to fragment a relation:

Data distribution: By fragmenting a relation, the data can be distributed across multiple locations, which can help to improve performance and scalability by reducing the load on a single location.
Data replication: By fragmenting a relation, the data can be replicated across multiple locations, which can help to improve data availability and reduce data loss in case of a failure.
Improved query performance: By fragmenting a relation, it can be easier to locate specific data, which can help to improve query performance.
Improved concurrency control: By fragmenting a relation, it can be easier to lock specific parts of the data, which can help to improve concurrency control and reduce contention.
Improved security: By fragmenting a relation, it can be easier to implement security controls, such as access control, on specific parts of the data.
Better scalability: By fragmenting a relation, it allows for better scalability, which is essential for growing organizations and large amounts of data.

Different Types of Fragmentation

There are several ways to carry out fragmentation in a relational database:

Horizontal fragmentation

This involves decomposing a relation into smaller relations based on the values of one or more attributes. For example, if we have a relation "customers" with the attributes "name," "address," "phone," and "email," we might fragment it into several relations based on the values of the "name" attribute. One relation might include all customers whose names start with "A-M," while another relation might include all customers whose names start with "N-Z".

Vertical fragmentation

This involves decomposing a relation into smaller relations based on the attributes. For example, if we have a relation "customers" with the attributes "name," "address," "phone," and "email," we might fragment it into several relations based on the attributes. One relation might include all the attributes "name" and "address", while another relation might include all the attributes "phone" and "email"

Hybrid fragmentation

This involves combining horizontal and vertical fragmentation. For example, in the example above, we can fragment relation into several relations based on the values of the "name" attribute and also based on the attributes.

Functional Dependency-Based fragmentation

This involves decomposing a relation based on the functional dependencies that exist between the attributes. For example, if we have a relation "orders" with the attributes "order_id," "customer_id," and "product_id," and the functional dependency "order_id -> customer_id" exists, we might fragment the relation into two relations: one relation including "order_id" and "customer_id" attributes, and another relation including "order_id" and "product_id" attributes.

A step-wise distributed database design methodology

A step-wise methodology for distributed database design is as follows:

Assess the distribution needs of the organization. Determine the location and level of distribution required for the database, such as at each branch office or regionally. This will impact the fragmentation method used.
Develop a comprehensive global Entity-Relationship (ER) diagram, if necessary. Identify the relations from entities and their attributes.
Analyze the key transactions in the system and determine where horizontal or vertical fragmentation may be beneficial.
Identify the relations that should not be fragmented. These relations will be replicated across all locations.
Examine the relations that are on the one-side of a relationship and determine an appropriate fragmentation schema for these relations. Relations on the many-side of a relationship may be suitable for derived fragmentation.
Check for situations where mixed fragmentation may be needed, such as when transactions require access to a subset of the attributes of a relation.
Set up the distributed architecture and implement the fragmentation schema, replication schema and security schema.
Test and evaluate the performance of the distributed system and make any necessary adjustments.
Continuously monitor and maintain the distributed system to ensure optimal performance and data consistency.

Client Server Database

A client-server database is a type of distributed database architecture where the database is stored on a central server, and clients (applications) access the database over a network. The clients send requests to the server, which then processes the requests and sends back the results.

Advantage

The main advantage of a client-server database is that it allows for centralized control and management of the data. The server is responsible for managing the data, enforcing security and access controls, and handling backups and other administrative tasks. This can help to improve data consistency and integrity.

In addition, a client-server database can be more scalable than a stand-alone database because the server can handle multiple clients and transactions simultaneously. This makes it well suited for environments where there are many concurrent users and high transaction volumes.

Disadvantage

The main disadvantage of a client-server database is that it can be more complex to set up and maintain than a stand-alone database. It also requires a reliable network connection to ensure that clients can access the data quickly and reliably.

Examples of client-server databases include MySQL, Microsoft SQL Server, and Oracle Database.

Emergence of Client Server Architecture

Some of the early advancements in relational database technology allowed for the distribution of computing across multiple computers on a network by leveraging:

Affordable, high-performance personal computers and servers
User-friendly graphical interfaces
Open-source systems
Object-oriented programming concepts
Collaborative workgroup computing
Electronic data interchange and email
Relational databases
Networking and data communication technologies

This allowed for a more efficient and cost-effective way of handling and processing large amounts of data, which greatly improved the overall performance and scalability of the relational database systems. These advancements made it possible for businesses to take advantage of distributed computing and networking capabilities in their database systems and improve their overall data management capabilities.

Need for Client Server Computing

Client-server computing is a type of distributed computing architecture in which an application is divided into two parts: the client and the server. The client, which is typically a personal computer or workstation, handles the presentation of data and provides the user interface, while the server, which is typically a more powerful machine, handles the storage, management, and processing of data. Both the client and server are connected to the same network, allowing for communication and data transfer between them.

One of the key advantages of client-server computing is scalability. As the number of users increases, additional clients can be added to the network, and as the load on the database server increases, additional servers can be connected. This allows the system to grow and adapt as needed.

Another advantage is that client-server systems can be connected through a variety of networks, including both local area networks (LANs) and wide area networks (WANs) across multiple locations. Communication between the client and server is achieved through standard application program interfaces (APIs) and remote procedure calls (RPCs), and the language used for communication is typically SQL (Structured Query Language).

Structure of Client Server Systems

In a client-server system, clients represent the users who need services, while servers provide those services. Both the client and server are made up of a combination of hardware and software. The server is a separate logical object that communicates with clients over a network to perform tasks together. A client makes a request for a service and receives a reply, and the server receives and processes the request and sends back the required response.

There are two main types of client-server architecture: 2-Tier and 3-Tier.

2-Tier Client/Server Models

In a 2-Tier Client/Server model, the client and the server communicate directly with each other over a network. The client and the server both contain a combination of hardware and software. The client is responsible for the user interface and presentation of data, while the server is responsible for the storage, management, and processing of data.

One of the main advantages of 2-Tier Client/Server models is their simplicity. They are easy to set up and maintain, and they do not require as much infrastructure or complexity as 3-Tier models.

2-Tier Client/Server models are best suited for small-scale applications with a limited number of users and a small amount of data. They are also suitable for applications that have a low level of security requirements and do not require complex data processing capabilities.

The main disadvantage of 2-Tier Client/Server models is that they are not as scalable or secure as 3-Tier models. They are also less efficient, as the client and the server have to communicate with each other directly. Additionally, the client and the server are tightly coupled, making it harder to make changes or add new features without affecting the entire system.

3-tier Architecture

3-Tier architecture is a client-server architecture that separates the user interface, application logic and data management layers into separate components. This allows for better separation of concerns, improved scalability, and easier maintenance.

The three layers in a 3-tier architecture are:

Presentation layer: This layer is responsible for the user interface and presentation of data. It handles input from the keyboard, mouse or other input devices and provides output in the form of screen displays.
Application layer: This layer is responsible for the functionality provided to an application program. It contains the business logic and rules that drive the enterprise. This layer communicates with the data layer to retrieve and update data.
Data layer: This layer is responsible for providing the generalized services needed by the other layers, including file services, print services, communications services and database services. It communicates with the underlying database to retrieve and update data.

One of the main advantages of 3-tier architecture is that it is more scalable and secure than 2-tier architecture. It allows for better separation of concerns, making it easier to make changes or add new features without affecting the entire system.

Another advantage of 3-tier architecture is that it allows for the use of different technologies for each layer, making it easier to replace or update one layer without affecting the others.

The main disadvantage of 3-tier architecture is that it can be more complex to set up and maintain than 2-tier architecture. It also requires more infrastructure, such as middleware and additional servers.

What are the advantages of distributed databases over centralised databases?

Distributed databases offer several advantages over centralised databases, such as:

Improved scalability: Distributed databases can handle a larger amount of data and a higher number of users than centralised databases.
Improved availability: Distributed databases can provide access to data even if one location becomes unavailable.
Increased performance: Distributed databases can provide faster access to data by replicating data to multiple locations.
Improved security: Distributed databases can protect data from a single point of failure or a data loss event.
Improved data recovery: Distributed databases can use replicated data to recover lost data in case of a server failure.

What is the difference between global and local transactions?

A global transaction is a transaction that spans multiple databases, also known as a distributed transaction. It involves multiple resources and systems, and it is coordinated by a transaction manager. It ensures that all updates to the databases are made in a consistent and atomic way.

A local transaction is a transaction that is confined to a single database. It involves a single resource and system, and it is managed by the database management system. It ensures that all updates to the database are made in a consistent and atomic way.

In summary, the main difference between global and local transactions is the scope and coordination of the updates. Global transactions involve multiple resources and systems, and are coordinated by a transaction manager, while local transactions involve a single resource and system and are managed by the database management system.

What are the rules for fragmenting data?

When fragmenting data, there are several rules to consider:

Fragmentation should be based on a logical, non-overlapping division of data.
Fragmentation should be based on access patterns and queries in order to optimize performance.
Fragmented data should be stored on different sites to ensure data availability and reduce the impact of site failures.
Fragmented data should be replicated to ensure data consistency and recovery in case of a site failure.
Fragmented data should be protected by appropriate security measures to ensure data confidentiality and integrity.

It is also important to ensure that fragmentation does not impede the ability to make updates, join tables or perform other types of operations on the data.

What is Data Replication and what are its advantages and disadvantages?

Data replication is the process of copying data from one location to another, typically across multiple servers or databases. The goal of data replication is to increase data availability, performance and scalability, and to provide a backup of the data.

Advantages of data replication include:

Improved data availability: Replicating data across multiple locations can increase the availability of the data, as users can access the data even if one location becomes unavailable.
Increased performance: By replicating data to multiple locations, users can access the data from a location closer to them, which can improve the performance of the system.
Improved scalability: Replication allows for the distribution of data and workloads across multiple servers, which can improve the scalability of the system.
Improved data security: By replicating data to multiple locations, you can protect the data from a single point of failure or a data loss event.
Improved data recovery: If a server fails, the replicated data can be used to recover the lost data.

Thank you to being a reader of GoogleClass post Time complexity and Space Complexity Analysis

https://www.googleclass.in/2023/01/understanding-database-recovery-and.html

https://www.googleclass.in/2022/12/sql-database-objects-views-synonyms.html

https://www.googleclass.in/2022/12/introduction-on-join-query-and-nested.html

https://www.googleclass.in/2022/12/introduction-to-sql-structured-query.html

time complexity and space complexity

transactions and concurrency