2 MB: Database

Showing posts with label Database. Show all posts

Friday, January 31, 2014

Frequently Asked Basic PostgreSQL Interview Questions and Answers

Frequently Asked Basic PostgreSQL Interview Questions and Answers

PostgreSQL is the widely used open source database. If you are preparing for PostgreSQL interview, following list of basic PostgreSQL interview questions and answers might help you in your interview preparation. Following PostgreSQL interview questions and answers cover PostgreSQL basic concepts like feature and advantages of PostgreSQL, key difference between MySQL and PostgreSQL, basic PostgreSQL database administration commands and tools, general PostgreSQL database concepts like Stored Procedures, Functions, Triggers, Cursor, Index, Joins, Subqueries etc.

1. What is PostgreSQL? What do you know about PostgreSQL?

PostgreSQL, often simply "Postgres", is an open-source object-relational database management system (ORDBMS) with an emphasis on extensibility and standards-compliance. It is released under the PostgreSQL License, a free/open source software license, similar to the MIT License. PostgreSQL is developed by the PostgreSQL Global Development Group, consisting of a handful of volunteers employed and supervised by companies such as Red Hat and EnterpriseDB.

Read more about PostgreSQL on Wikipedia and PostgreSQL official website

2. What are the various features and advantages of PostgreSQL?

This is very basic question and you should be updated on this. You should know why are you using PostgreSQL in your project, what features and advantages does PostgreSQL provide.

Visit official PostgreSQL website to learn more features and advantages of PostgreSQL

3. What are the key differences between MySQL and PostgreSQL? Which Open Source Database to Choose? Which one is best?

MySQL and PostgreSQL are both free and open source powerful and full-featured databases. You should be able to compare these two databases. Here is the complete article on this.

4. What are the various PostgreSQL database administration commands and tools?

You should know basic PostgreSQL database administration commands like creating users in PostgreSQL, setting up user credentials in PostgreSQL, change / update PostgreSQL user password, check whether PostgreSQL is up and running, commands to create, delete, drop, start, stop, restart, backup, restore PostgreSQL database, getting the list of all databases in PostgreSQL, finding out what version of PostgreSQL is running, PostgreSQL help and history commands, commands to get the list of all the tables in a PostgreSQL database, commands to turn on timing and checking how much time a query takes to execute, commands to see the list of available functions in PostgreSQL etc. Here is the complete article on this topic.

You should also know some of the PostgreSQL administration tools. You can visit Wiki and Stackoverflow to get to know various PostgreSQL administration tools.

5. PostgreSQL database general concepts

Beside all this you should be well aware of datatypes in PostgreSQL, DDL, DML, DCL commands used in PostgreSQL. You should have good knowledge of Indexes, Joins, Subqueries, Stored Procedures, Functions, Triggers, Cursors etc.

I hope you will get benefited by these basic PostgreSQL interview questions and answers.

16 PostgreSQL Database Administration Commands

16 PostgreSQL Database Administration Commands

Following are basic PostgreSQL database administration commands which each PostgreSQL database administrator should know. These PostgreSQL database administration commands include creating users in PostgreSQL, setting up user credentials in PostgreSQL, change / update PostgreSQL user password, check whether PostgreSQL is up and running, commands to create, delete, drop, start, stop, restart, backup, restore PostgreSQL database, getting the list of all databases in PostgreSQL, finding out what version of PostgreSQL is running, PostgreSQL help and history commands, commands to get the list of all the tables in a PostgreSQL database, commands to turn on timing and checking how much time a query takes to execute, commands to see the list of available functions in PostgreSQL etc. Lets have a look on following PostgreSQL Database Administration Commands.

1. How to change PostgreSQL root user password?

$ /usr/local/pgsql/bin/psql postgres postgres

Password: (oldpassword)

# ALTER USER postgres WITH PASSWORD 'tmppassword';

$ /usr/local/pgsql/bin/psql postgres postgres

Password: (tmppassword)

Changing the password for a normal postgres user is similar as changing the password of the root user. Root user can change the password of any user, and the normal users can only change their passwords as Unix way of doing.

# ALTER USER username WITH PASSWORD 'tmppassword';

2. How to setup PostgreSQL SysV startup script?

$ su - root

# tar xvfz postgresql-8.3.7.tar.gz

# cd postgresql-8.3.7

# cp contrib/start-scripts/linux /etc/rc.d/init.d/postgresql

# chmod a+x /etc/rc.d/init.d/postgresql

3. How to check whether PostgreSQL server is up and running?

$ /etc/init.d/postgresql status

Password:

pg_ctl: server is running (PID: 6171)

/usr/local/pgsql/bin/postgres "-D" "/usr/local/pgsql/data"

[Note: The status above indicates the server is up and running]

$ /etc/init.d/postgresql status

Password:

pg_ctl: no server running

[Note: The status above indicates the server is down]

4. How to start, stop and restart PostgreSQL database?

# service postgresql stop

Stopping PostgreSQL: server stopped

# service postgresql start

Starting PostgreSQL: ok

# service postgresql restart

Restarting PostgreSQL: server stopped

5. How do I find out what version of PostgreSQL I am running?

$ /usr/local/pgsql/bin/psql test

Welcome to psql 8.3.7, the PostgreSQL interactive terminal.

Type: \copyright for distribution terms

\h for help with SQL commands

\? for help with psql commands

\g or terminate with semicolon to execute query

\q to quit

test=# select version();

version

----------------------------------------------------------------------------------------------------

PostgreSQL 8.3.7 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42)

(1 row)

test=#

6. How to create a PostgreSQL user?

There are two methods in which you can create user.

Method 1: Creating the user in the PSQL prompt, with CREATE USER command.

# CREATE USER ramesh WITH password 'tmppassword';

CREATE ROLE

Method 2: Creating the user in the shell prompt, with createuser command.

$ /usr/local/pgsql/bin/createuser sathiya

Shall the new role be a superuser? (y/n) n

Shall the new role be allowed to create databases? (y/n) n

Shall the new role be allowed to create more new roles? (y/n) n

CREATE ROLE

7. How to create a PostgreSQL Database?

There are two metods in which you can create two databases.

Method 1: Creating the database in the PSQL prompt, with createuser command.

# CREATE DATABASE mydb WITH OWNER ramesh;

CREATE DATABASE

Method 2: Creating the database in the shell prompt, with createdb command.

$ /usr/local/pgsql/bin/createdb mydb -O ramesh

CREATE DATABASE

* -O owner name is the option in the command line.

8. How do I get a list of databases in a Postgresql database?

# \l [Note: This is backslash followed by lower-case L]

List of databases

Name | Owner | Encoding

----------+----------+----------

backup | postgres | UTF8

mydb | ramesh | UTF8

postgres | postgres | UTF8

template0 | postgres | UTF8

template1 | postgres | UTF8

9. How to Delete/Drop an existing PostgreSQL database?

# \l

List of databases

Name | Owner | Encoding

----------+----------+----------

backup | postgres | UTF8

mydb | ramesh | UTF8

postgres | postgres | UTF8

template0 | postgres | UTF8

template1 | postgres | UTF8

# DROP DATABASE mydb;

DROP DATABASE

10. Getting help on postgreSQL commands

\? will show PSQL command prompt help. \h CREATE will shows help about all the commands that starts with CREATE, when you want something specific such as help for creating index, then you need to give CREATE INDEX.

# \?

# \h CREATE

# \h CREATE INDEX

11. How do I get a list of all the tables in a Postgresql database?

# \d

On an empty database, you’ll get “No relations found.” message for the above command.

12. How to turn on timing, and checking how much time a query takes to execute?

# \timing — After this if you execute a query it will show how much time it took for doing it.

# \timing

Timing is on.

# SELECT * from pg_catalog.pg_attribute ;

Time: 9.583 ms

13. How To Backup and Restore PostgreSQL Database and Table?

We discussed earlier how to backup and restore postgres database and tables using pg_dump and psql utility.

14. How to see the list of available functions in PostgreSQL?

To get to know more about the functions, say \df+

# \df

# \df+

15. How to edit PostgreSQL queries in your favorite editor?

# \e

\e will open the editor, where you can edit the queries and save it. By doing so the query will get executed.

16. Where can I find the PostgreSQL history file?

Similar to the Linux ~/.bash_history file, postgreSQL stores all the sql command that was executed in a history filed called ~/.psql_history as shown below.

$ cat ~/.psql_history

alter user postgres with password 'tmppassword';

\h alter user

select version();

create user ramesh with password 'tmppassword';

\timing

select * from pg_catalog.pg_attribute;

Friday, January 17, 2014

MongoDB vs Cassandra: Difference and Similarities between MongoDB and Cassandra

MongoDB vs Cassandra: Difference and Similarities between MongoDB and Cassandra

MongoDB and Cassandra have occupied a very big market of NoSQL Schema-Free databases. So, before going to choose any one of them, it is logical to compare basic features of both MongoDB and Cassandra. MongoDB and Cassandra both are fully-featured NoSQL databases and choosing one of them heavily depends upon your application requirements. There are a lot of similarities and differences between MongoDB and Cassandra. Following is the basic comparison of MongoDB and Cassandra in terms of data storage model, usage etc.

Differences between MongoDB and Cassandra

1. MongoDB has a document-oriented data model while Cassandra has column-oriented data model and storage type.

MongoDB acts much like a relational database. Its data model consists of a database at the top level, then collections which are like tables in MySQL (for example) and then documents which are contained within the collection, like rows in MySQL. Each document has a field and a value where this is similar to columns and values in MySQL. Fields can be simple key / value e.g. { 'name': 'David Mytton' } but they can also contain other documents e.g. { 'name': { 'first' : David, 'last' : 'Mytton' } }.

In Cassandra documents are known as “columns” which are really just a single key and value. e.g. { 'key': 'name', 'value': 'David Mytton' }. There’s also a timestamp field which is for internal replication and consistency. The value can be a single value but can also contain another “column”. These columns then exist within column families which order data based on a specific value in the columns, referenced by a key. At the top level there is a keyspace, which is similar to the MongoDB database.

2. MongoDB is developed by MongoDB, Inc while Cassandra is a product of Apache Software Foundation. The original authors of MongoDB are core contributors to the code and work for 10gen (indeed, 10gen was founded specifically to support MongoDB and the CEO and CTO are the original creators). In contrast, Cassandra was created by 2 engineers from Facebook and is incubated by the Apache Foundation. MongoDB was initially released in 2009 while Cassandra was released in 2008.

3. MongoDB is implemented in C++ while Cassandra is implemented in Java.

4. MongoDB supports Linux, OS X, Solaris and Windows server operating systems while Cassandra supports BSD, Linux, OS X and Windows server operating systems.

5. MongoDB supports more programming languages than Cassandra. MongoDB supports Actionscript, C, C#, C++, Clojure, ColdFusion, D, Dart, Delphi, Erlang, Go, Groovy, Haskell, Java, JavaScript, Lisp, Lua, MatLab, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Scala and Smalltalk. Cassandra supports C#, C++, Clojure, Erlang, Go, Haskell, Java, JavaScript, Perl, PHP, Python, Ruby and Scala.

6. Handling of indexes is different in MongoDB and Cassandra

MongoDB indexes work very similar to relational databases. You create single or compound indexes on the collection level and every document inserted into that collection has those fields indexed. Querying by index is extremely fast so long as you have all your indexes in memory.

Prior to Cassandra 0.7 it was essentially a key/value store so if you want to query by the contents of a key (i.e the value) then you need to create a separate column which references the other columns i.e. you create your own indexes. This changed in Cassandra 0.7 which allowed secondary indexes on column values, but only through the column families mechanism. Cassandra requires a lot more meta data for indexes and requires secondary indexes if you want to do range queries.

7. Handling of replication is different in MongoDB and Cassandra

In MongoDB replication is achieved through replica sets. This is an enhanced master/slave model where you have a set of nodes where one is the master. Data is replicated to all nodes so that if the master fails, another member will take over. There are configuration options to determine which nodes have priority and you can set options like sync delay to have nodes lag behind (for disaster recovery, for example).

Writes in MongoDB are “unsafe” by default; data isn’t written right away by default so it’s possible that a write operation could return success but be lost if the server fails before the data is flushed to disk. This is how Mongo attains high performance. If you need increased durability then you can specify a safe write which will guarantee the data is written to disk before returning. Further, you can require that the data also be successfully written to n replication slaves.

MongoDB drivers also support the ability to read from slaves. This can be done on a connection, database, collection or even query level and the drivers handle sending the right queries to the right slaves, but there is no guarantee of consistency (unless you are using the option to write to all slaves before returning). In contrast Cassandra queries go to every node and the most up to date column is returned (based on the timestamp value).

Cassandra has much more advanced support for replication by being aware of the network topology. The server can be set to use a specific consistency level to ensure that queries are replicated locally, or to remote data centres. This means you can let Cassandra handle redundancy across nodes where it is aware of which rack and data centre those nodes are on. Cassandra can also monitor nodes and route queries away from “slow” responding nodes.

The only disadvantage with Cassandra is that these settings are done on a node level with configuration files whereas MongoDB allows very granular ad-hoc control down the query level through driver options which can be called in code at run time.

8. MongoDB provides a custom map/reduce implementation while Cassandra provides native Hadoop support, including for Hive (a SQL data warehouse built on Hadoop map/reduce) and Pig (a Hadoop-specific analysis language that many think is a better fit for map/reduce workloads than SQL).

9. MongoDB supports server-side scripting while Cassandra does not.

10. Major users of MongoDB are Craigslist, Foursquare, Shutterfly, Intuit while Facebook, Twitter and Digg heavily use Cassandra.

11. MongoDB is commercial while Cassandra is free.

Similarities between MongoDB and Cassandra

1. Both MongoDB(AGPL) and Cassandra(Apache 2.0 license) are open-source databases.

2. Both MongoDB and Cassandra are NoSQL schema-free databases.

3. Both MongoDB and Cassandra support Sharding.

4. Both MongoDB and Cassandra support concurrency. Concurrency is supported by using Locks in MongoDB while concurrency in Cassandra is achieved by MVCC.

5. Both MongoDB and Cassandra support Eventual and Immediate Consistency.

6. None supports Foreign keys, Transaction concepts, Triggers etc.

Thursday, January 9, 2014

Types and Examples of NoSQL Databases

Types and Examples of NoSQL Databases

NoSQL databases are growing with very rapid speed because of their exciting features like more flexibility and scalability, schema-free architecture, easy replication support, simple API, consistent / BASE (not ACID), support for big data and more. Key-Values Stores, Column Family Stores, Document Databases, Graph Databases are the types of NoSQL Databases. MongoDB, CouchDB, CouchBase, Cassandra, HBase, Redis, Riak, Neo4J are the popular NoSQL databases. MongoDB, CouchDB, CouchBase are document-oriented NoSQL databases, Redis and Riak are key-value stores, Cassandra and HBase are column family stores and Neo4J is a graph database. Lets discuss these types of databases in detail.

1. Key-Values Stores

The main idea here is using a hash table where there is a unique key and a pointer to a particular item of data. The Key/Value model is the simplest and easiest to implement. But it is inefficient when you are only interested in querying or updating part of a value, among other disadvantages.

Examples: Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB, Amazon SimpleDB, Riak

2. Column Family Stores

These were created to store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns. The columns are arranged by column family.

Examples: Cassandra, HBase

3. Document Databases

These were inspired by Lotus Notes and are similar to key-value stores. The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSON. Document databases are essentially the next level of Key/value, allowing nested values associated with each key. Document databases support querying more efficiently.

Examples: CouchDB, MongoDb

4. Graph Databases

Instead of tables of rows and columns and the rigid structure of SQL, a flexible graph model is used which, again, can scale across multiple machines. NoSQL databases do not provide a high-level declarative query language like SQL to avoid overtime in processing. Rather, querying these databases is data-model specific. Many of the NoSQL platforms allow for RESTful interfaces to the data, while other offer query APIs.

Examples: Neo4J, InfoGrid, Infinite Graph

28 MongoDB NoSQL Database Interview Questions and Answers

28 MongoDB NoSQL Database Interview Questions and Answers

MongoDB is the best free open source NoSQL document oriented database. If you are preparing for the technical interview on MongoDB NoSQL database, must prepare the following MongoDB NoSQL database interview questions. These MongoDB NoSQL database interview questions cover basic concepts of NoSQL databases, Replication, Sharding, Transactions and Locking, Profiler, Nuances, Journalling features etc. Lets have a look on following MongoDB NoSQL database interview questions:

1. What do you mean by NoSQL databases? What is the difference between NoSQL and RDBMS? Why and why not to use NoSQL databases? What are the various advantages of NoSQL databases?

I have written a complete blog post to answer above questions here.

2. What are the various types of NoSQL databases?

Types of NoSQL Databases

Examples: MongoDB, Cassandra, CouchDB, Hypertable, Redis, Riak, Neo4j, HBASE, Couchbase, MemcacheDB, RevenDB and Voldemort are the examples of NoSQL databases. Read in detail.

3. What is the basic difference between MySQL and MongoDB?

MySQL and MongoDB are both free and open source databases. MySQL and MongoDB have a lot of basic differences in terms of data representation, querying, relationships, transactions, schema design and definition, normalization, speed and performance. By comparing MySQL with MongoDB, we are comparing Relational and non-relational databases. Read in detail.

4. How do you compare MongoDB with CouchDB and CouchBase?

MongoDB and CouchDB are both document-oriented databases. MongoDB and CouchDB are the best examples of open source NoSQL database. Aside from both storing documents though, it turns out that they don't share much in common. There are a lot of difference between MongoDB and CouchDB in terms of implementation of their data-model, interface, object storage, replication methods etc.

For details, go through following links:

MongDB vs CouchDB

CouchDB vs CouchBase

5. What makes MongoDB best?

Following features of MongoDB make it best NoSQL database:

Document-oriented

High performance

High availability

Easy scalability

Rich query language

6. What is 32 bit nuances?

There is extra memory mapped file activity with journaling. This will further constrain the limited db size of 32 bit builds. Thus, for now journaling by default is disabled on 32 bit systems.

7. Will the journal replay have problems if entries are incomplete (like the failure happened in the middle of one)?

Each journal (group) write is consistent and won't be replayed during recovery unless it is complete.

8. What is role of Profiler in MongoDB?

MongoDB includes a database profiler which shows performance characteristics of each operation against the database. Using the profiler you can find queries (and write operations) which are slower than they should be; use this information, for example, to determine when an index is needed.

9. What's a "namespace"?

MongoDB stores BSON objects in collections. The concatenation of the database name and the collection name (with a period in between) is called a namespace.

10. If you remove an object attribute is it deleted from the store?

Yes, you remove the attribute and then re-save() the object.

11. Can I use the journaling feature to perform safe hot backups?

Yes

12. Are null values allowed?

For members of an object, yes. You cannot add null to a database collection though as null isn't an object. You can add {}, though.

13. Does an update fsync to disk immediately?

No, writes to disk are lazy by default. A write may hit disk a couple of seconds later. For example, if the database receives a thousand increments to an object within one second, it will only be flushed to disk once. (Note fsync options are available though both at the command line and via getLastError_old.)

14. How do I do transactions/locking?

MongoDB does not use traditional locking or complex transactions with rollback, as it is designed to be lightweight and fast and predictable in its performance. It can be thought of as analogous to the MySQL MyISAM autocommit model. By keeping transaction support extremely simple, performance is enhanced, especially in a system that may run across many servers.

15. Why are my data files so large?

MongoDB does aggressive preallocation of reserved space to avoid file system fragmentation.

16. How long does replica set failover take?

It may take 10-30 seconds for the primary to be declared down by the other members and a new primary elected. During this window of time, the cluster is down for "primary" operations – that is, writes and strong consistent reads. However, you may execute eventually consistent queries to secondaries at any time (in slaveOk mode), including during this window.

17. What's a master or primary?

This is a node/member which is currently the primary and processes all writes for the replica set. In a replica set, on a failover event, a different member can become primary.

18. What's a secondary or slave?

A secondary is a node/member which applies operations from the current primary. This is done by tailing the replication oplog (local.oplog.rs).

Replication from primary to secondary is asynchronous, however the secondary will try to stay as close to current as possible (often this is just a few milliseconds on a LAN).

19. Do I have to call getLastError to make a write durable?

No. If you don't call getLastError (aka "Safe Mode") the server does exactly the same behavior as if you had. The getLastError call simply lets one get confirmation that the write operation was successfully committed. Of course, often you will want that confirmation, but the safety of the write and its durability is independent.

20. Should I start out with sharded or with a non-sharded MongoDB environment?

We suggest starting unsharded for simplicity and quick startup unless your initial data set will not fit on single servers. Upgrading to sharding from unsharded is easy and seamless, so there is not a lot of advantage to setting up sharding before your data set is large.

21. How does sharding work with replication?

Each shard is a logical collection of partitioned data. The shard could consist of a single server or a cluster of replicas. We recommmend using a replica set for each shard.

22. When will data be on more than one shard?

MongoDB sharding is range based. So all the objects in a collection get put into a chunk. Only when there is more than 1 chunk is there an option for multiple shards to get data. Right now, the default chunk size is 64mb, so you need at least 64mb for a migration to occur.

23. What happens if I try to update a document on a chunk that is being migrated?

The update will go through immediately on the old shard, and then the change will be replicated to the new shard before ownership transfers.

24. What if a shard is down or slow and I do a query?

If a shard is down, the query will return an error unless the "Partial" query options is set. If a shard is responding slowly, mongos will wait for it.

25. Can I remove old files in the moveChunk directory?

Yes, these files are made as backups during normal shard balancing operations. Once the operations are done then they can be deleted. The cleanup process is currently manual so please do take care of this to free up space.

26. How can I see the connections used by mongos?

db._adminCommand("connPoolStats");

27. If a moveChunk fails do I need to cleanup the partially moved docs?

No, chunk moves are consistent and deterministic; the move will retry and when completed the data will only be on the new shard.

28. If I am using replication, can some members use journaling and others not?

Yes

Wednesday, January 8, 2014

CouchDB vs CouchBase: Differences and Similarities between CouchDB and CouchBase

CouchDB vs CouchBase: Differences and Similarities between CouchDB and CouchBase

Apache CouchDB and CouchBase, both are Free, Open Source, NoSQL document databases using JSON as their document format. Apache CouchDB and CouchBase have a lot of similarities and differences. Basically CouchBase is made up by combining the features of CouchDB and MemBase databases. CouchDB's document-oriented data model, indexing and querying capabilities are combined with MemBase's distributed key-value data model, high performance, easily scalable, always-on capabilities are combined to form CouchBase.

In short, we can say:

CouchBase = CouchDB + MemBase

But, CouchBase is not a new version of CouchDB but it is in fact a new version of MemBase. CouchBase Server is actually a new name for the MemBase Server. CouchBase is not a drop-in replacement for CouchDB but is a drop-in replacement for Memcached. CouchBase still uses the Memcached protocol and not the RESTful API of CouchDB. Meanwhile CouchDB is still CouchDB, actively maintained and enhanced as an Apache project. Also, CouchDB is not obsoleted by the CouchBase. CouchDB is an actively maintained open-source project and CouchBase is a completely separate project.

Similarities between CouchDB and CouchBase

1. CouchDB and CouchBase, both are NoSQL document databases using JSON as their document format.

2. Both CouchDB and CouchBase use the same approach to indexing and querying.

3. Both CouchDB and CouchBase use the same replication system except peer to peer replication.

Although CouchBase has been developed by combining the exciting features of CouchDB and MemBase, both CouchDB and CouchBase have a lot of differences with respect to their clustering, caching and licensing systems.

Differences between CouchDB and CouchBase

1. Clustering System

CouchBase has a built-in clustering system that allows data to be automatically spread across multiple nodes. Apache CouchDB, on the other hand, is a single node solution with peer-to-peer replication technology and is better suited for decentralized systems and for holding data amounts that do not need to be spread out across multiple nodes.

2. Caching System

CouchBase - like MemBase - has a built-in, Memcached-based caching technology that consistently provides sub millisecond read/write performance and is capable of hundreds of thousands of ops per second per node. Apache CouchDB is a disk-based database that usually is better suited for use cases where super low latency or high throughput is not a requirement.

3. Licensing System

The CouchBase is not entirely open-source/free software. There are two versions: Community Edition (free but no latest bug fixes) and Enterprise Edition (there are restrictions on usage, confidentiality provisions, audits by CouchBase Inc. and other terms typical to proprietary software that many people may find unacceptable).

CouchDB is an open-source/free software (no strings attached) project of The Apache Software Foundation and is released under the Apache License, Version 2.0.

4. Other Differences

CouchBase does not support following features of CouchDB:

A) RESTful API (only for views, not for CRUD operations)

B) Peer-to-peer replication

C) CouchApps

D) Futon (there is a different administration interface available)

E) Document IDs

F) Notion of databases (there are only buckets)

G) Replication between a CouchDB database and CouchBase Server

H) Explicit attachments (you have to store additional files as new key/value pairs)

I) HTTP API for everything is not available in CouchBase unlike CouchDB (you need to use the Couchbase Server SDKs or one of the Experimental Client Libraries at Couchbase Develop so no experiments with curl and wget)

J) CouchDB API (it uses the Memcached API instead)

K) You can't do everything from the browser in CouchBase unlike CouchDB (you have to write a server-side application)

L) No two-tier architecture for Web apps is possible in CouchBase unlike CouchDB(you have to write a server-side application to sit between the browser and the database, like with relational databases)

M) Leventual consistency

Saturday, January 4, 2014

12 Best Free and Open Source NoSQL Databases

12 Best Free and Open Source NoSQL Databases

NoSQL databases are becoming popular day by day. I have come up with the list of best, free and open source NoSQL databases. MongoDB tops the list of Open Source NoSQL databases. This list of free and open source databases comprises of MongoDB, Cassandra, CouchDB, Hypertable, Redis, Riak, Neo4j, HBASE, Couchbase, MemcacheDB, RevenDB and Voldemort. These free and open source NoSQL databases are really highly scale-able, flexible and good for big data storage and processing. These open source NoSQL databases are far ahead in terms of performance as compared to traditional relational databases. However, these may not be always the best choice for you. Most of common applications can still be developed using traditional relational databases. NoSQL databases are still not the best option for a mission critical transaction needs. I have listed down a small description of all these best free and oper source NoSQL databases. Lets have a look..

1. MongoDB

MongoDB is a document-oriented database that uses a JSON-style data format. It's ideal for website data storage, content management and caching applications, and can be configured for replication and high availability.

This highly scale-able and agile NoSQL database is a amazing performing system. This open source database written in C++ comes with a storage that is document oriented. Also, you will be provided with benefits like full index support, high availability across WANs and LANs along with easy replication, horizontal scaling, rich queries that are document based, flexibility in data processing and aggregation along with proper training, support and consultation.

2. Cassandra

An Apache Software Foundation project, Cassandra is a distributed database that allows for decentralized data storage that is fault tolerant and has no single point of failure. In other words, "Cassandra is suitable for applications that can't afford to lose data."

3. CouchDB

A product of the Apache Software Foundation, CouchDB is another document-oriented database that stores data in JSON format. It's ACID compliant, and like MongoDB, can be used to store data and content for websites, and to provide caching. You can use JavaScript to run MapReduce Queries on CouchDB. It also provides a very convenient web based administration console. This database could be really handy for web applications.

4. Hypertable

Modeled after Google's BigTable database system, Hypertable's creators aim for it to be the "open source standard for highly available, petabyte scale, database systems." In other words, Hypertable is designed for storing massive amounts of data reliably across many cheap servers.

5. Redis

This is an open source, key value store of an advanced level. Owing to the presence of hashes, sets, strings, sorted sets and lists in a key; Redis is also called as a data structure server. This system will help you in running atomic operations like incrementing value present in a hash, set intersection computation, string appending, difference and union. Redis makes use of in-memory dataset to achieve high performance. Also, this system is compatible with most of the programming languages.

6. Riak

This is one of the most powerful, distributed databases ever to be introduced. It provides for easy and predictable scaling and equips users with the ability for quick testing, prototyping and application deployment so as to simplify development.

7. Neo4j

This is a NoSQL graph database which exhibits a high level of performance. It comes well equipped with all the features of a robust and mature system. It provides the programmers with a flexible and object oriented network structure and allows them to enjoy all the benefits of a database that is fully transactional. Compared to RDBMS, Neo4j will also provide you with performance improvements on some of the applications.

8. Hadoop HBASE

HBase can be easily considered as a scalable, distributed and a big data store. This database can be used when you are looking for real time and random access to your data. It comes with modular and linear scalability along with reads and writes that are strictly consistent. Other features include Java API that has an easy client access, table sharding that is configurable and automatic, Bloom filters and block caches and much more.

9. Couchbase

While Couchbase was a fork of CouchDB, it has become more of a full-fledged data product and less of a ball of framework than CouchDB. Its transition to a document database will give MongoDB a run for its money. It is multithreaded per node, which can be a major scalability benefit -- especially when hosted on custom or bare-metal hardware. With some nice integration features, including with Hadoop, Couchbase is a great choice for an operational data store.

10. MemcacheDB

This is a distributed storage system of key value. It should not be confused with a cache solution; rather, it is a persistent storage engine which is meant for data storage and retrieval in a fast and reliable manner. Confirmation to memcache protocol is provided for. The storing backend that is used is the Berkeley DB which supports features like replication and transaction.

11. REVENDB

RAVENDB is a second generation open source DB. This DB is document oriented and schema free such as you simply have to dump in your objects into it. It provides extremely flexible and fast queries. This application makes scaling extremely easy by providing out-of-the-box support for replication, multi tenancy and sharding. There is full support for ACID transactions along with safety of your data. Easy extensibility via bundles is provided along with high performance.

12. Voldemort

This is an automatically replicating distributed storage system. It provides for automatic partitioning of data, transparent handling of server failure, pluggable serialization, independence of nodes and versioning of data items along with support for data distribution across various centers.

Friday, January 3, 2014

NoSQL vs RDBMS: Why and why not to use NoSQL over RDBMS?

NoSQL vs RDBMS: Why and why not to use NoSQL over RDBMS?

NoSQL (not only SQL) is not a relational database management system (RDBMS). We will discuss what is the difference between NoSQL databases and Relational Databases and then why and why not to use NoSQL database model over traditional and relational database model (RDBMS) in detail. As NoSQL is the new technology, it is also facing many challenges, so will also have a look upon them.

Today, the internet world has billions of users. Big Data, Big Users, and Cloud Computing are the big technologies which every major internet application is using or preparing to use because internet application users are growing day by day and data is becoming more and more complex and unstructured which is very hard to manage using traditional relational database management system (RDBMS). NoSQL technology has the answer to all these problems. NoSQL is meant for Unstructured Big Data and Cloud Computing. A NoSQL database is exactly the type of database that can handle the all sort of unstructured, messy and unpredictable data that our system of engagement requires. NoSQL is a whole new way of thinking about a database.

Difference between NoSQL and Relational Data Models (RDBMS)

Relational and NoSQL data models are very different.

The relational model takes data and separates it into many interrelated tables that contain rows and columns. Tables reference each other through foreign keys that are stored in columns as well. When looking up data, the desired information needs to be collected from many tables (often hundreds in today’s enterprise applications) and combined before it can be provided to the application. Similarly, when writing data, the write needs to be coordinated and performed on many tables.

NoSQL databases have a very different model. For example, a document-oriented NoSQL database takes the data you want to store and aggregates it into documents using the JSON format. Each JSON document can be thought of as an object to be used by your application. A JSON document might, for example, take all the data stored in a row that spans 20 tables of a relational database and aggregate it into a single document/object. Aggregating this information may lead to duplication of information, but since storage is no longer cost prohibitive, the resulting data model flexibility, ease of efficiently distributing the resulting documents and read and write performance improvements make it an easy trade-off for web-based applications.

Another major difference is that relational technologies have rigid schemas while NoSQL models are schemaless. Relational technology requires strict definition of a schema prior to storing any data into a database. Changing the schema once data is inserted is a big deal, extremely disruptive and frequently avoided – the exact opposite of the behavior desired in the Big Data era, where application developers need to constantly – and rapidly – incorporate new types of data to enrich their apps.

It also may not provide full ACID (atomicity, consistency, isolation, durability) guarantees, but still has a distributed and fault tolerant architecture.

The NoSQL taxonomy supports key-value stores, document store, BigTable, and graph databases.

In comparison, document databases are schemaless, allowing you to freely add fields to JSON documents without having to first define changes. The format of the data being inserted can be changed at any time, without application disruption.

Examples: MongoDB, Cassandra, CouchDB, HBase are the examples of NoSQL.

NoSQL Database Types

Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.

Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB.

Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality.

Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.

Why to use NoSQL Databases?

1. NoSQL has Flexible Data Model to Capture Unstructured / Semi-structured Big Data

Data is becoming easier to capture and access through third parties such as Facebook, D&B, and others. Personal user information, geo location data, social graphs, user-generated content, machine logging data, and sensor-generated data are just a few examples of the ever-expanding array of data being captured. It’s not surprising that developers want to enrich existing applications and create new ones made possible by it. And the use of the data is rapidly changing the nature of communication, shopping, advertising, entertainment, and relationship management. Apps that don’t leverage it quickly will quickly fall behind.

Developers want a very flexible database that easily accommodates new data types and isn’t disrupted by content structure changes from third-party data providers. Much of the new data is unstructured and semi-structured, so developers also need a database that is capable of efficiently storing it. Unfortunately, the rigidly defined, schema-based approach used by relational databases makes it impossible to quickly incorporate new types of data, and is a poor fit for unstructured and semi-structured data. NoSQL provides a data model that maps better to these needs.

A lot of applications might gain from this unstructured data model: tools like CRM, ERP, BPM, etc, could use this flexibility to store their data without performing changes on tables or creating generic columns in a database. These databases are also good to create prototypes or fast applications, because this flexibility provides a tool to develop new features very easily.

2. NoSQL is highly and easily scalable (Scale up vs Scale out)

If millions of users are using your app frequently and concurrently, you need to think about the scalable database technology instead of traditional RDBMS. With relational technologies, many application developers find it difficult, or even impossible, to get the dynamic scalability and level of scale they need while also maintaining the performance users demand. You need to switch to NoSQL databases.

For the cloud applications, relational databases were originally the popular choice. Their use was increasingly problematic however, because they are a centralized, share-everything technology that scales up rather than out. This made them a poor fit for applications that require easy and dynamic scalability. NoSQL databases have been built from the ground up to be distributed, scale-out technologies and therefore fit better with the highly distributed nature of the three-tier Internet architecture.

Scale up vs Scale out

To deal with the increase in concurrent users (Big Users) and the amount of data (Big Data), applications and their underlying databases need to scale using one of two choices: scale up or scale out. Scaling up implies a centralized approach that relies on bigger and bigger servers. Scaling out implies a distributed approach that leverages many standard, commodity physical or virtual servers.

Scale up with relational technology: limitations at the database tier

At the web/application tier of the three-tier Internet architecture, a scale out approach has been the default for many years and worked extremely well. As more people use an application, more commodity servers are added to the web/application tier, performance is maintained by distributing load across an increased number of servers, and the cost scales linearly with the number of users.

Prior to NoSQL databases, the default scaling approach at the database tier was to scale up. This was dictated by the fundamentally centralized, shared-everything architecture of relational database technology. To support more concurrent users and/or store more data, you need a bigger and bigger server with more CPUs, more memory, and more disk storage to keep all the tables. Big servers tend to be highly complex, proprietary, and disproportionately expensive, unlike the low-cost, commodity hardware typically used so effectively at the web/application server tier.

Scale out with NoSQL technology at the database tier

NoSQL databases were developed from the ground up to be distributed, scale out databases. They use a cluster of standard, physical or virtual servers to store data and support database operations. To scale, additional servers are joined to the cluster and the data and database operations are spread across the larger cluster. Since commodity servers are expected to fail from time-to-time, NoSQL databases are built to tolerate and recover from such failure making them highly resilient.

NoSQL databases provide a much easier, linear approach to database scaling. If 10,000 new users start using your application, simply add another database server to your cluster. Add ten thousand more users and add another server. There’s no need to modify the application as you scale since the application always sees a single (distributed) database.

At scale, a distributed scale out approach also usually ends up being cheaper than the scale up alternative. This is a consequence of large, complex, fault tolerant servers being expensive to design, build and support. Licensing costs of commercial relational databases can also be prohibitive because they are priced with a single server in mind. NoSQL databases on the other hand are generally open source, priced to operate on a cluster of servers, and relatively inexpensive.

While implementations differ, NoSQL databases share some characteristics with respect to scaling and performance:

DYNAMIC SCHEMAS

Relational databases require that schemas be defined before you can add data. For example, you might want to store data about your customers such as phone numbers, first and last name, address, city and state – a SQL database needs to know what you are storing in advance.

This fits poorly with agile development approaches, because each time you complete new features, the schema of your database often needs to change. So if you decide, a few iterations into development, that you'd like to store customers' favorite items in addition to their addresses and phone numbers, you'll need to add that column to the database, and then migrate the entire database to the new schema.

If the database is large, this is a very slow process that involves significant downtime. If you are frequently changing the data your application stores – because you are iterating rapidly – this downtime may also be frequent. There's also no way, using a relational database, to effectively address data that's completely unstructured or unknown in advance.

NoSQL databases are built to allow the insertion of data without a predefined schema. That makes it easy to make significant application changes in real-time, without worrying about service interruptions – which means development is faster, code integration is more reliable, and less database administrator time is needed.

AUTO-SHARDING

Because of the way they are structured, relational databases usually scale vertically – a single server has to host the entire database to ensure reliability and continuous availability of data. This gets expensive quickly, places limits on scale, and creates a relatively small number of failure points for database infrastructure. The solution is to scale horizontally, by adding servers instead of concentrating more capacity in a single server.

"Sharding" a database across many server instances can be achieved with SQL databases, but usually is accomplished through SANs and other complex arrangements for making hardware act as a single server. Because the database does not provide this ability natively, development teams take on the work of deploying multiple relational databases across a number of machines. Data is stored in each database instance autonomously. Application code is developed to distribute the data, distribute queries, and aggregate the results of data across all of the database instances. Additional code must be developed to handle resource failures, to perform joins across the different databases, for data rebalancing, replication, and other requirements. Furthermore, many benefits of the relational database, such as transactional integrity, are compromised or eliminated when employing manual sharding.

NoSQL databases, on the other hand, usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption.

Cloud computing makes this significantly easier, with providers such as Amazon Web Services providing virtually unlimited capacity on demand, and taking care of all the necessary database administration tasks. Developers no longer need to construct complex, expensive platforms to support their applications, and can concentrate on writing application code. Commodity servers can provide the same processing and storage capabilities as a single high-end server for a fraction of the price.

“Sharding” a relational database can reduce, or eliminate in certain cases, the ability to perform complex data queries. NoSQL database systems retain their full query expressive power even when distributed across hundreds of servers.

INTEGRATED CACHING

A number of products provide a caching tier for SQL database systems. These systems can improve read performance substantially, but they do not improve write performance, and they add complexity to system deployments. If your application is dominated by reads then a distributed cache should probably be considered, but if your application is dominated by writes or if you have a relatively even mix of reads and writes, then a distributed cache may not improve the overall experience of your end users.

Many NoSQL database technologies have excellent integrated caching capabilities, keeping frequently-used data in system memory as much as possible and removing the need for a separate caching layer that must be maintained.

REPLICATION

Most NoSQL databases also support automatic replication, meaning that you get high availability and disaster recovery without involving separate applications to manage these tasks. The storage environment is essentially virtualized from the developer's perspective.

Challenges of NoSQL

The promise of the NoSQL database has generated a lot of enthusiasm, but there are many obstacles to overcome before they can appeal to mainstream enterprises. Here are a few of the top challenges.

1. Maturity

RDBMS systems have been around for a long time. NoSQL advocates will argue that their advancing age is a sign of their obsolescence, but for most CIOs, the maturity of the RDBMS is reassuring. For the most part, RDBMS systems are stable and richly functional. In comparison, most NoSQL alternatives are in pre-production versions with many key features yet to be implemented.

Living on the technological leading edge is an exciting prospect for many developers, but enterprises should approach it with extreme caution.

2. Support

Enterprises want the reassurance that if a key system fails, they will be able to get timely and competent support. All RDBMS vendors go to great lengths to provide a high level of enterprise support.

In contrast, most NoSQL systems are open source projects, and although there are usually one or more firms offering support for each NoSQL database, these companies often are small start-ups without the global reach, support resources, or credibility of an Oracle, Microsoft, or IBM.

3. Analytics and business intelligence

NoSQL databases have evolved to meet the scaling demands of modern Web 2.0 applications. Consequently, most of their feature set is oriented toward the demands of these applications. However, data in an application has value to the business that goes beyond the insert-read-update-delete cycle of a typical Web application. Businesses mine information in corporate databases to improve their efficiency and competitiveness, and business intelligence (BI) is a key IT issue for all medium to large companies.

NoSQL databases offer few facilities for ad-hoc query and analysis. Even a simple query requires significant programming expertise, and commonly used BI tools do not provide connectivity to NoSQL.

Some relief is provided by the emergence of solutions such as HIVE or PIG, which can provide easier access to data held in Hadoop clusters and perhaps eventually, other NoSQL databases. Quest Software has developed a product -- Toad for Cloud Databases -- that can provide ad-hoc query capabilities to a variety of NoSQL databases.

4. Administration

The design goals for NoSQL may be to provide a zero-admin solution, but the current reality falls well short of that goal. NoSQL today requires a lot of skill to install and a lot of effort to maintain.

5. Expertise

There are literally millions of developers throughout the world, and in every business segment, who are familiar with RDBMS concepts and programming. In contrast, almost every NoSQL developer is in a learning mode. This situation will address naturally over time, but for now, it's far easier to find experienced RDBMS programmers or administrators than a NoSQL expert.

Conclusion

NoSQL databases are becoming an increasingly important part of the database landscape, and when used appropriately, can offer real benefits. However, enterprises should proceed with caution with full awareness of the legitimate limitations and issues that are associated with these databases.