Friday, September 6, 2013

Document Management System for Liferay Portal in Clustered Environment

Document management system for Liferay Portal

Objective:

                We want to scale the Liferay to manage documents in hierarchical manner which we uploaded using Documents and Media Library, where our Liferay is in a cluster mode and we can maintain the data as easy as we can.

Different Ways to achieve DMS in Liferay:

                Liferay introduces a new Document and Media Library which is capable of mounting several repositories at a time and presenting a unified interface to the user. By Default, users can make use of the Liferay repository, which is already mounted. This repository is built into Liferay Portal and can use as its back-end one of several different store implementations. In addition to this, many different kind of third party repositories can be mounted. If you have a separate repository you’ve mounted, all nodes of the cluster will point to this repository. Your avenue for improving performance at this point is to cluster your third party repository, using the documentation of the repository you have chosen. If you don’t have a third party repository, there are ways you can configure the Liferay repository to perform well in clustered configuration.
                The main thing to keep in mind is you need to make sure every node of the cluster has the same access to the file store as every other node. For this reason, you’ll need to take a look at your store configuration. Here below I mentioned some default store that Liferay support and its pros and cons to use it.
·         File System Store
·         Advanced File System Store
·         CMIS Store
·         S3 Store (Amazon Simple Storage service)
·         Documentum Store
·         JCR Store
Now we will look into the detail of each and every store and its implementation mechanism to configure it with Liferay.

·         File System Store:


This is a default storage mechanism in Liferay to store documents. It’s a simple file storage implementation that uses a local folder to store files. You can use the file system in cluster environment but you’d have to make sure that the folder to which you point the store can handle things like concurrent requests and file locking. For this reason, we have to use a Storage Area Network (SAN) or a clustered file system.
Features:
·         The file system store was the first store created by Liferay and is heavily bound to the Liferay Database.
·         This Store creates a folder structure based on primary keys in the Liferay database.
Pros:
·         As you can see, this binds your documents very closely to Liferay, and may not be exactly what you want. But If you have been using default settings for a while and need to migrate your documents, then Liferay provides a migration utility in the control panel in Server Administration ->  Data Migration.  Using this utility, you can move your documents easily from one store implementation to another.
Cons:
·         File System store is dependent on the size of the local operation systems’ file storage size. Sometimes numbers of files which can be stored in particular folder are too large and heavy and the local file system doesn’t have enough space to store it.
·         It’s not the perfect match when we have to use Liferay in Cluster mode, Because if we go with this approach then  we have to use SAN or Clustered file system which cost is too high to manage
·         It’s not fit when the concurrent users are writing to the file then the File system is not that much efficient to manage synchronization. It will not internally locking the instance and synchronize the contents. For this it has to be dependent on third party.

·         Advanced File System Store:


Advanced File system store is similar to the default file system store, but in file system store, it saves the file to the local file system- which, of course, could be remote file system mount. It uses slightly different folder structure to store the file.
Pros:
·         Several operating systems have limitations on the number of files which can be stored in a particular folder. The advanced file store overcomes this limitation by programmatically creating a structure that can expand to millions of files, by alphabetically nesting the files in folders. This is not only allows more file to be stored, but also improves the performance as there are less file stored per folder.
                Cons:
·         Here also the same rule applied to the advanced file system store as apply to the default file system store. To cluster this, you’ll need to point the store to a network mounted file system that all the nodes can access, and that networked file system needs to support concurrent requests and file locking. Otherwise you may experience data corruption issues if two users attempt from two different nodes to write to the same file at the same time.
·         Advanced file system store doesn’t serve your needs. For this we have to look for other options.

·         CMIS Store:


CMIS (Content Management Interoperability Services) is an open standard that allows different content management systems to inter-operate over the Internet. Specially CMIS defines an abstraction layer for controlling diverse document management systems and repositories using web protocols.
Features:
·         CMIS defines a domain model plus web services and Restful AtomPub  bindings that can be used by applications.
·         CMIS provides a common data model covering typed files and folders with generic properties that can be read.
·         There is a set of services for adding and retrieving documents.
·         There may be an access control system, a checkout and version control facility and ability to define generic relations.
·         you can communicate using CMIS using 2 protocols SOAP and REST to access WSDL using  the AtomPub conventions.
·         This model  is based on common architectures of document management system

Pros:
·         The CMIS specification provides a web services interface that is program language agnostic (REST or SOAP are implemented in many languages)
·         The CMIS specification provides a web services interface that decouples web service and content. So CMIS can be used to access a historic document repository.
·         There is a facility to mount clustered CMIS repository by the administrator of the Liferay through the UI.
·         Its best approach fits in Clustered environment as all Liferay nodes are pointing to your CMIS repository, everything in your Liferay cluster should be fine, as the CMIS protocol prevents multiple simultaneous file access from causing data corruption.

·         S3 Store (Amazon Simple Storage service):


Amazon’s is a cloud based storage solution which you can use with Liferay. Amazon S3 is storage for the internet. It is designed to make web-scale computing easier for developers.
Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.
Features:
·         Write, read, and delete objects containing from 1 byte to 5 terabytes of data each. The number of objects you can store is unlimited.
·         Each object is stored in a bucket and retrieved via a unique, developer-assigned key.
·         A bucket can be stored in one of several Regions. You can choose a Region to optimize for latency, minimize costs, or address regulatory requirements. Objects stored in a Region never leave the Region unless you transfer them out.
·         Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users.
·         Options for secure data upload/download and encryption of data at rest are provided for additional data protection.
·         Uses standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit.
·         Built to be flexible so that protocol or functional layers easily be added. The default download protocol is HTTP. A Bit Torrent protocol interface is provided to lower costs for high-scale distribution.
·         Provides functionality to simplify manageability of data through its lifetime. Includes options for segregating data by buckets, monitoring and controlling spend and automatically archiving data to even lower cost storage options.
Pros:
·         The main advantage is easy to set up with Liferay, When you sign up for the service, Amazon assigns you some unique keys which link you to your account. In Amazon’s interface, you can create “buckets” of data optimized by region. Once you’ve created these to your specifications, all you need to do is declare them in portal-ext.properties:
        dl.store.s3.access.key=
        dl.store.s3.secret.key=
        dl.store.s3.bucket.name=
Cons:
·         Its proprietary product, so its cost is too high to maintain DMS

·         Documentum Store:


If you have a Liferay Portal EE license, you have access to the Documentum hook which adds support for Documentum to Liferay’s Documents and Media Library. For this you have to install it from Liferay Market place.
This hook doesn’t add an option to make the Liferay repository into a Documentum repository, as the other store implementations do. Instead, it gives you the ability to mount Documentum repositories via the Document and Media library UI.
There’s not really a lot to this; it’s incredibly easy. Click Add → Repository, and in the form that appears, choose Documentum as the repository type. After that, give it a name and specify the Documentum repository and cabinet, and Liferay mounts the repository for you. That’s really all there is to it.
If all your nodes are pointing to a Documentum repository, you can cluster Documentum to achieve higher performance.
More information is available here: http://www.liferay.com/marketplace/-/mp/application/15098914  

·         JCR Store:


Liferay is a Content Management System (CMS) that is rich in features, flexible and easy to learn. The Java Community Process developed a solution to this trend- the JSR-170 and JSR-283, also known as Java Content Repository (JCR) API.
The JCR Specification provides a unified interface that different vendors can implement to meet the needs of content management system. Application developers, on the other hand, are saved from learning different propriety APIs, thus, reducing time-to-market. They just need to learn one API that is compatible with any JSR 170/283 complaint repository. This framework is not only vendor neutral. It is also not tied to any particular underlying architecture. The back-end data storage could be a file system, a WEBDEV repository, an XML backed system or an SQL based database.
In addition to flexibility, the Java Content Repository is like a fusion of a database and a file system. Among the valuable features of this integration are:
§  Support for both structured and unstructured content.
§  Hierarchical design
§  SQL and/or XPath Query
§  Access control
§  Locking
§  Versioning
§  Full-text search
There are a lot of JCR-compliant repositories are already available in the market.
Liferay supports as a store the JCR standard. Using the by- default settings, the JCR store is not very different from the file system stores.
Using the default settings, the JCR Store is not very different from the file system stores, except you can use JCR client to access the files.  We can use any one of the JCR Implementation for this. Right now we have two below mentioned choices.
·         Jackrabbit
·         JBoss Mode Shape

Let’s understand these two approaches in detail

Jackrabbit:
§  Jackrabbit is the complete implementation of the JCR API. The Apache Jackrabbit content repository is a fully conforming implementation of the content repository for Java technology API (JCR Specified in JSR 170 and 283).
·         A content repository is a hierarchical content store with support for structured and unstructured content, full text search, versioning, transactions, observations, and more.
·         In Liferay by default Jackrabbit is used as a JCR Store.
·         JSR 170 explicitly allows for numerous different deployment models, meaning that it is entirely up to the repository implementation to suggest certain models
·         Here jackrabbit is built to support a variety of different deployment models, some of the possibilities on how to deploy jackrabbit will be outlined here.
§  Embedded Mode
§  Standalone Mode.
·         In Liferay by default Jackrabbit is available as embedded mode in JCR Store. It means we have to just enable some properties in portal-ext.properties file and configure the repository.xml file of the jackrabbit to enable jackrabbit in Liferay, which is actually in embedded mode. Here I will explain you the basic difference of embedded mode and standalone mode.
·         In Embedded mode of the jackrabbit, when we start the Liferay the jackrabbit is automatically initialize and destroyed when we stop the Liferay.  While in standalone mode we have to manually start the jackrabbit server and setup everything manually there.
·         In Liferay, the jackrabbit is available in embedded mode , so its working fine when we have are working on just one Liferay instance but when our Liferay is in cluster mode then we also have to keep jackrabbit as a shared repository. For this we just have to change the following property in portal-ext.properties file

jcr.jackrabbit.repository.root=${liferay.home}/data/jackrabbit

Change this property to the shared folder that all the Liferay nodes can see. We have to do this because if we keep the jackrabbit local to each node then there is a heavy problem on synchronization of the data. So for that we keep the shared repository for all the Liferay nodes. For doing this, we have to create a new configuration file (repository.xml) at this shared location and also do the changes in all the Liferay nodes and point the repository to this new location.
·         But there is a major drawback to use this kind of configuration because of file locking issues, this isn’t the best way to share jackrabbit resources, unless you are using a networked file system that can handle concurrency and file locking. If two members logged in at the same time and try to upload the content, you could encounter the data corruption using this method. Because of this we don’t recommend to use this configuration.
·         Here I would recommend that if  your Liferay is in a cluster mode than use JCR in a cluster, you should redirect jackrabbit into your database of choice. You can use Liferay’s database or any other database of your choice for this purpose. For this you just have to change the configuration file( repository.xml) and point to the database.
·         If your Liferay is in a cluster mode then every node contains the jackrabbit configuration and on every node one repository.xml file is there, you have to make the changes in the same file and point to the database of your choice.
·         Once you have configured Jackrabbit to store its repository in a database, the next time when you bring up the Liferay, the necessary database tables are created automatically. Jackrabbit does not created indexes on the table values by itself, so you have to manually index the primary key of the tables or write your logic to index the values automatically.
·         It is the best approach for storing the documents in a database then file system like Advanced File system store because here you get the benefit of the clustering also.
·         One major advantage is that when you upgrade Liferay from lower version to upper version, Liferay itself provide to support to upgrade your jackrabbit configuration because it provided by Liferay itself.
               


JBoss Mode Shape:
·         Mode Shape is a distributed, hierarchical, transactional, and consistent data store with support for queries, full–text search, events, versioning, references and flexible and dynamic schemas.  It is very fast, highly available, extremely scalable, and it is 100% open source and written in java.
·         Mode Shape is perfect for data that is organized in a tree-like hierarchical structure where related data is stored close together, where navigation to related content is just as common and important as fast key based lookups or queries. The hierarchical organization is similar to a file system, making ModeShape a natural for storing files annotated with metadata. ModeShape can even automatically extract the structured information within the files so that clients can navigate or use typed queries to find files satisfying complex, structurally-oriented criteria. ModeShape is an excellent store for data with a complex schema, since the schema can vary over the database and evolve over time. ModeShape is the perfect distributed data store for all kinds of applications, including repositories, content management systems, historical data services, provisioning and governance systems, and metadata management systems.
·         Mode Shape supports all JCR 2.0 required features:
o   Repository acquisition
o   Authentication
o   Reading/navigating
o   Query
o   Export
o   Node type discovery
o   Permissions and capability checking.
·         And most of the JCR 2.0 optional features:
o   Writing
o   Import
o   Observation
o   Workspace management
o   Versioning
o   Locking
o   Node type management
o   Same –name siblings
o   Orderable child nodes
o   Shareable nodes
·         Mode Shape is an open source implementation of the JCR 2.0 API and thus behaves like a regular JCR repository. Applications can search, query, navigate, change, version, listen for changes, etc. Mode Shape can store the content in a variety of back-end stores (including relation databases, Infinispan data grids, JBoss Cache, etc.), or it can access and update existing content from “other” kinds of systems (including file systems, SVN repositories, JDBC database metadata, and other JCR repositories).
·         In Mode Shape, most of the times the data is organized using the following way.


·         Each JCR node contains the following elements :
o   Name path and identifier
o   Properties (name and values)
o   Child nodes
o   One or more Node Type.
·         Features of the Mode Shape:
o   All data is organized in a hierarchical tree-like structure of nodes, single- and multi-valued properties, and children.
o   All data is cached and stored in Infinispan, which can persist data on the file system, in databases, in the cloud, and even distributed in-memory across a data grid
o   Cluster to distribute/replicate data across multiple machines, and even keep most/all of it in-memory to form a data-grid with extremely fast access (faster than from local disk)
o   Implements the JSR-283 standard Java API for content repositories (aka, JCR 2.0)
o   Define a schema with node types and mixins that (optionally) limit the properties and children for various kinds of nodes, and evolve the schema over time without having to migrate the data.
o   Use multiple query languages, including SQL-like, XPath, and full-text search languages to find data.
o   Use sessions to create and validate large amounts of content transiently, and then save all changes with one call.
o   ModeShape can be configured to use and participate in JTA transactions.
o   Register to be notified with events when data is changed anywhere in the cluster, optionally filtered by custom criteria.
o   Segregate data into multiple repositories and workspaces.
o   Embed ModeShape into your Java SE, EE, or web applications.
o   Install into JBoss AS7 and applications to centrally configure, manage, and monitor repositories.
·         Mode Shape can be embedded into your standalone and web applications, or installed and run as a service in JBoss AS 5.x or 6.x
·         ModeShape is a JCR 2.0 implementation that supports all of JCR 2.0 required features: repository acquisition, authentication, reading/navigating, query, export, node type discovery, and permissions and capability checking.
·         ModeShape also implements most of the optional JCR 2.0 features: writing, import, observation, workspace management, versioning, locking, node type management, same-name siblings, shareable nodes, and orderable child nodes.
·         Mode Shape in Java EE application:
o   Till now we just cover the functionality and features of Mode Shape. Now we talk about how we can use it in Java EE application and Liferay portal.
o   Mode Shape makes it easy to use JCR repositories within Web and Java EE applications deployed to virtually any web or application server.
o   Mode Shape is very small and light weight enough that you can very easily embed it into your own Java SE applications. And doing so is remarkably easy.  The only thing that you determine is how much control and management your application will need to have over the Mode Shape repositories. On One hand, if your application needs just to look up and use one or more JCR Repository instances, then it could use the JCR API or on the other hand, your application may need more control over dynamically deploying, monitoring, changing configuration, and undeploying individual repositories. In this case, your application can use the ModeShape –Specific API.
o   For most part, the best way to use Mode Shape within a web application deployed to Tomcat, Glassfish or other containers or application servers is to simply embed it into your web application. At that point, it should be very similar to Mode Shape in Java Application.
o   If you have a several web apps that share the same Mode Shape repositories, embedding Mode Shape into each and using the same configuration files should work or you could create a single web app to manage the Mode Shape repositories.
·         Mode Shape in Liferay application :
o   In Liferay mainly we need to integrate Mode Shape to manage DMS (Document Management System). Documents that we uploaded from Document and Media Library System.
o   There are 2 ways to integrate ModeShape in Liferay, One is embed the mode shape by overriding the basic JCR implementation code(JCR API) that is available in Liferay and another way is use Mode shape as a standalone mode and communicates it using different protocols like CMIS, REST API, WebDev, etc.
o   If you are using Liferay instance alone without cluster then it is advisable the use ModeShape in embedded mode. It is possible to integrate mode shape in Tomcat. See this section for more information.
o   Now we have to look for more scenarios where our Liferay is in a cluster mode then we have to think from following scenarios
§  We can embed the Mode Shape inside your Liferay portal but there is big problem on Data Synchronization.
§  Mode shape still is in Clustered mode in the same way (via JGroups), and the data is stored in Infinispan (which also needs to be clusted).
§  Mode Shape uses transactions, so concurrent writes are absolutely possible without global write locks (as is the limitation of jackrabbit). Please see this section for more information. Here we rely upon Infinispan support for and use of transactions to help make this possible.
§  Mode Shape also supports serialization means if you have multiple JCR Sessions in the same process or spread across your cluster) that are regularly writing/updating the same node and saving at the same time, then all of these changes will be serialized, one of them will block others. But most of time the multiple JCR sessions will be updating different nodes, in which case there is no blocking.
§  Now to use the Mode shape there are several way to use it like,
·          if you want to use JCR API in your application then you need to run Mode Shape within the same process(es).  So if you're running as regular Java SE applications, your application would instantiate and starts the ModeShapeEngine. If your applications are deployed to a web server (e.g., Tomcat), then you can either embed ModeShape inside your singular web app or have the web server run ModeShape (e.g., in Tomcat via the "server.xml" file) have have your application(s) look up the repositories in JNDI.  See how Mode Shape can be clustered here
·         If you want to use REST API or WebDAV, then your application will simply access the server using our REST API or the WebDAV protocol. If you’re going to use CMIS, then your application would use the CMIS REST API or CMIS client application framework.  But all of these are remote protocols will be less efficient than using the JCR API from within the same process, simply because they require network communication. Mode Shape didn’t have a good implantation and support for all this network protocol as they build for just basic purpose so it will not handle such complex scenarios like heavy write or multiple users writing at the same time.
§  To Configure Mode Shape in our application, we have to use JSON format, we can’t use XML format as they are not supported yet.
§  Mode Shape internally uses Infinispan to store its backend Data.
§  To Cluster the Liferay, we use mode shape as a DMS then we have to configure Mode Shape configuration in JSON format on each and every Liferay nodes and we can also cluster the Mode Shape based on our requirement. As Mode Shape use Infinispan and Infinispan supports clustering at different level. Then we can use mode shape in cluster using Infinispan.
o   Mode Shape also have the ability to access external and internal data in exactly the same way as if it were stored in one place is what we call federation.
o   Currently Mode Shape is dependent on Infinispan and Jgroups.
o   Mode Shape also provides a number of connectors out-of-the-box . These are ready to be used by simply including them in the classpath and configuring them as a repository source.
§  JPA Connector (Java Persistence API)
§  JCR Connector (Java Content Repository)

·         Summary:


This document is mainly created to maintain Liferay’s DMS (Document Management System). As there are several ways available by default then we can use that different ways based on our requirement. Like File System is available by default and we are normally using the same.
When we think about the scalability perspective, then there are several other factors that we have to think like Clustering, Synchronizing, version, concurrent writes, performance, authentication, and permissions, export mechanism, etc.
To improve the performance, we will do the clustering of our Liferay nodes then we also have to think about the configuration of the DMS as there are multiple Liferay nodes which are communicating to DMS using different ways then we have to think which DMS to use.
·         I will clearly say that when Liferay is in cluster mode then File system, Advanced File system, are not suggested as there is a heavy cost for SAN.
·         If you are already integration your application to any ECM then I would suggest use CMIS store to communicate to ECM. Liferay provides the easy steps to use CMIS to connect to Alfresco ECM.
·         If you don’t have an issue for proprietary product which cost is too high to manage the data then I would suggest use S3 Store which keeps your data in cloud.
·         Documentum is provided by Liferay EE plugin only so we can look for that option but this is also a proprietary product and limitation is there in terms of scalability.
·         If you want to save your cost and easily integrate DMS with your application in cluster mode then I would suggest use Jackrabbit (JCR Implementation) to store the documents in Database instead of File system because it performs well in clustering but only two loop falls are there, one is that we have to manually index the data in the database tables and the other is it will not support the concurrent writing mechanism in cluster mode. One advantage to use this when the Liferay version upgrade to newer version, Liferay provides the support for upgrading the jackrabbit also.

·         We can use the Mode Shape as DMS which is one of the JCR implementation which is good in compare to Jackrabbit but the only problem is Liferay itself is not provided it so we cannot get the support in future from Liferay in upgrading procedure. When our application is in cluster mode then we can use Mode shape in cluster mode because Mode Shape provides all the functionality for JCR 2.0 required and optional features, on added to this it also supports the concurrent writes, which is not supported in Jackrabbit. But if we have to use JCR API which is most powerful API then all other, then we have to use Mode Shape in embedded in your own web server cluster, if you are using a separate cluster of servers to use Mode Shape cluster then you have to use REST API or WebDAV or CMIS, but of these we cannot use any one because Mode shape is not supported all its functionality using these APIs, if we still have to use Mode Shape in separate cluster of servers then we have to implement our own REST API Service on top of Mode Shape that is deployed with the Mode shape cluster. The benefit is that you can size the ModeShape cluster for throughput/load separately from your application; the disadvantage is the additional network overhead.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.