Alfresco Two-Way Sync with Apache Camel

1 Alfresco Two-Way Sync with Apache CamelPeter Lesty Tech...

Author: 坑韵于

0 downloads 3 Views

1 Alfresco Two-Way Sync with Apache CamelPeter Lesty Technical Director - Parashift Hello everyone my name's Peter Lesty and I am the technical director and also as part owner of Parashift: which is an Australian based Alfresco partner.One of the questions we get asked quite a lot from prospective and existing clients is what can Alfresco integrate with, and how do you go about it? Over the course of the last 4 years since Parashift's inception, it's been my job to answer that question:How do you integrate with Alfresco? What systems can it talk to? How would you go about integrating with this legacy system?Now depending on the type of integration we have normally provided a different answer, but it's always a yes we can! Today I'd like to focus a bit closer around a particular type of integration, synchronisation of content between two systems. Even more specific, I'd like to focus on two-way synchronisation with Alfresco.

2 Synchronisation Between Alfresco and External SystemsThe Problem Synchronisation Between Alfresco and External Systems So the problem we're faced with is how do we get Alfresco talking, and what tools are available out there to assist with this. Before that, I'd like to go into a few real world scenarios that we've had to find solutions for. These scenarios are from our existing clients, and existing projects, and do vary quite a bit in there problem scope.My challenge as technical director was to find something we could use to address these issues and unknown ones in the future, and reuse as much as we can, rather than reinventing the wheel each time.

3 Alfresco Two-Way SynchronisationSync a selection of Nodes between Instances Not Limited to Folders and Files, should include Data Lists, Wikis and Forums Should Sync Document Locks and Permissions as well as Metadata Updates Network Partition Resilient: Aim for AP in CAP Theorem So our first problem is the main one we're talking about today, which is we have two seperate Alfresco instances, they are regionally seperate, as in on two different continents, and the connection at times could be unreliable.We also didn't want to sync everything across, only a folder or two or maybe a full site, and also not limited to documents but include data lists, forums and wikis.It should also sync permissions and, documents locks and updates to the metadata including adding and removing aspects, any custom metadata that's describedIf the network is to go down, you don't want to prevent users from being able to edit documents in their relevant instance just because the other side is not available. This means we have to aim for an AP system when using CAP theorem, which is Availability and Partition Tolerance.

4 Geospatial Content SynchronisationProprietary Oracle DB w/ File system content Custom Search Schema Required (incl. Geospatial Search) for Public Facing Website Daily Synchronisation An earlier project we did involved migrating data from a legacy oracle database into Alfresco. This database stored the metadata with the content on the file system in a specific directory structure. From there we also wanted to sync out to a publically accessible solr instance for public data. Now I have specified that in this simple diagram that it is external to Alfresco's own Solr instance. There are two main reasons for this, the first is we wanted a custom schema that doesn't align with the schema that the Alfresco solr supports. The other reason was geospatial data, some of the nodes had polygon extents attached to them, so you can do area selections when searching.

5 Alfresco Sirsi Dynix SynchronisationSync Nodes with Specific Aspects to Sirsi Dynix for Cataloguing Translate Alfresco Content Model into Marc21 Fields Report back any Sync-Related Errors and Update Reference The third scenario we'd like to address is a synchronisation between Alfresco and Sirsi Dynix, which is a Library Cataloguing system. There is a record within Alfresco, that should be synced into sirsi dynix, but with a conversion to the Marc21 Metadata format. Marc21 is interesting in that is has this concept of subfields, so you have one field which may have one or more subfields associated with it.We also want to sync back any errors in case there are issues with the sync process or the other end reports an issue.

6 Open Source EIP FrameworkApache Camel Open Source EIP Framework So what could we use to address all of these scenarios in a similiar fashion, exposing only the configuration bits we need, rather than having to develop bespoke solutions each time. Can we decouple any of the components and rearrange them as we see fit?To represent these integrations, we need a language to talk about them. Like with BPMN for business process, integration has its own design standard: The Enterprise Integration Patterns. All of the scenarios I described above can be represented in EIP quite easily, which itself gives us a great start to talk and reason about the solutions. Ideally though, we'd like to map the EIP routes we have defined and apply them somehow. We need an Activiti for EIP.Luckily, there are a couple of frameworks out there that implement EIP. The two main frameworks that come to mind are Mule ESB, and Apache Camel. We investigated both, and while Mule ESB has its merits and is a very viable candidate, I happened to be more productive with Camel due to its lightweight nature.

7 Apache Camel Open Source Enterprise Integration Pattern Framework (Not an ESB) 100+ Components (File, JDBC, CMIS, REST, JMS, etc..) Multiple Route DSLs (XML, Java, Groovy, Kotlin) Custom Components + Beans Open Source (Apache 2.0 License) So what is Apache Camel? Well at its core its a library that implements Enterprise Integration Patterns. By itself it's not a fully blown Enterprise Service Bus, but by adding a couple of extra bits we get part of the way there.Camel defines routes between components. A component is a connection into an external system, normally either a consumer of information. That is, to pull information down from a system, or a producer of information, which will send updates to a remote system. Components are wired together and defined within a route, which describes the flow of information. So you can say "from this file system, pull down all the files, and then put them in a message queue".When a route is in action, an exchange is the discrete body of work which includes headers and a body. So if we're using the file system example, each file becomes an exchange, with the body equal to the file's content as an input stream, and the headers of the exchange including path, file name, modification date, and any other metadata.Routes can get a bit more tricky than just simple from and to, and can include things like splitting and merging, filtering, transformation, marshalling and another number of operations. For this to be versatile enough, camel provides a number of DSLs for defining routes: Java, XML, Kotlin, Groovy, all of which allow you to express your routes in a meaningful manner.Most of the time the components in Camel will be sufficient to cover your use bases. CMIS, REST, JDBC, are all supported. But there are times when you require more elaborate customisations, so Camel does allow you to easily define and develop your own components.Lastly, Camel is licensed using the Apache 2.0 license, which allows a lot of freedom in how you use the library.

8 Apache Camel – Recommended StackApache Karaf (OSGi Container) Hawtio (Web Console) Blueprint (OSGi DI Framework) Install Using Karaf CLI: feature:repo-add camel feature:repo-add hawtio feature:install camel feature:install camel-core feature:install camel-blueprint feature:install hawtio So with all of this ability to use camel in a variety of different ways, and with a variety of different DSLs and runtimes, what would be a good recommendation to get started?Well, with all of the scenarios we've been through, we have found that the following stack for running apache camel works well. This is not the only stack you can use of course, as camel is quite flexible, but this stack does come with some added benefits.Firstly we use Apache Karaf or Apache Service Mix as an OSGi Container. Service Mix and Karaf are very similar but Service Mix is more batteries included. It may include things you don't want though, so my preference is to use Karaf and install only what you need.Within Karaf we want to use hawtio as a web console which allows us to monitor and live update routes within Camel. We can also use this to view logs and otherwise monitor how karaf and camel is going. You don't have to use a web console for this information, Karaf comes with a command line shell, so you can execute commands accordingly.For the Route DSL, because we're using an OSGi container, we'll use Blueprint. For those familiar with Spring XML, Blueprint is highly similar, but as we're using OSGi, we can hot-load routes as we go just by dropping an xml file in Karaf's deploy directoryWhen you download karaf, you can use the following commands to easily install camel and hawtio.

9 Camel Routes Route ConfigurationsSo in our main scenario, which is Alfresco two way synchronisation, what does our Route look like, and how do we deploy it?

10 Apache Camel – Two Way RouteDrop a Blueprint XML file into the Karaf Deploy Folder Poll and Consume Events from Alfresco Remote Instance Limit to specific Sites or Paths Prevent a Feedback Loop of Events Submit to Alfresco Local Instance Deployed to Both sides Quite simply, we want to subscribe or consume information from Alfresco, do a bit of filtering, and then publish or produce changes.We can use blueprint to drop this route into the karaf deploy directory which activates the route.The route starts with consuming information from Alfresco, so that each node is turned into an exchange within camel. We then want to do a site filter, so we only consume nodes we're interested in.The next step is to filter out any feedback loops, which we'll touch on a bit later, and lastly to submit any updates to Alfresco.This route is deployed to both instances, so there is always two routes active at any given time. It's possible to have only one instance of Karaf, but to make it more robust, we use two.So what component do we want to use to sync? We have about 100 available, which one is suitable?

11 Alfresco Camel ComponentAlfStream Alfresco Camel Component The most Obvious candidate is CMIS, and you can definitely get a rudimentary synchronisation using the CMIS protocol. But we found that there are a few limitations with using CMIS that may or may not be a show stopper for your own use case.What we did instead was create a custom camel component which we have named AlfStream

12 AlfStream – Alfresco Camel ComponentEvent Sourcing: Treats Alfresco as a Sequence of Events in an Event Log Use Transaction IDs for Tracking and Pagination – No ACL Check limitations and no reliance on time Retroactively applied – Does not rely on the Audit Service RESTful Endpoints - JSON for Consumer, Multipart for Producer Idempotent – Facilities for handling duplicate events Potential to expand to other frameworks such as Mule ESB or Standalone AlfStream is a camel component we have been developing over time which integrates nicely with Alfresco and the types of usage scenarios we've mentioned.One of the main design decisions was to treat Alfresco as an event streaming, applying the event sourcing design pattern that is starting to dominate a lot of big data type architectures. We treat Creation and Updates as the same event type, upsert, and handle Delete events as well. This is one limitiation with the existing cmis component, handling deletions of contentRather than rely on modification date to track changes, we use the raw transaction id a node is associated with. In that way we avoid the common pitfall of missing nodes if the modification date is skewy.It's retroactively applied, and doesn't rely on the Audit Service, which can get quite big if you have had an instance running for a long time.We use RESTful-like endpoints in the true sense of the word, in that the same request will create the same outcomeThis tends towards Idempotence as a quality. If we get duplicate nodes come through, it doesn't really matter, as we'll operate on them in the same wayBecause our endpoints are RESTful, they're quite easy to reuse elsewhere in other systems like Mule ESB or even as a Standalone App.

13 AlfStream Consumer – Alfresco Repo AMPRESTful Repo-End Webscript: maxResults: max number of results to get back per call (500 by default) fromTxnId: beginning transaction ID toTxnId: ending transaction ID (uses last transaction ID from current time if not set) fromNodeId: For pagination within a Transaction range if there are more than 500 entries Array of JSON NodeEvents (Using GSON): [{ "nodeRef": "91e4b557-20a ca3-285d31a323d8", "properties": { "cm_created": " T02:21:28.823Z", "cm_title": "Data Dictionary", "imap_maxUid": 0, "cm_description": "User managed definitions", "app_icon": "space-icon-default", "cm_creator": "System", "sys_node-uuid": "91e4b557-20a ca3-285d31a323d8", "cm_name": "Data Dictionary", "sys_store-protocol": "workspace", "sys_store-identifier": "SpacesStore", "sys_node-dbid": 14, "sys_locale": "en_US", "cm_modifier": "admin", "cm_modified": " T07:05:46.313Z", "imap_changeToken": "0a7a199a-2d1a-4fd1-b04c-7ef39fc9b35d" }, "eventType": "UPSERT", "type": "cm_folder", "path": "/Company Home" }] So how does it work? Well first off we have an Alfresco AMP that is installed within the repo side. This presents an array of JSON Objects that display the properties of the node, alongside some pagination parameters so we can iterate through a range of transactions

14 AlfStream Consumer – Camel ComponentPolls Repo Webscript Keeps Track of the current Transaction ID Converts NodeEvents into Camel Exchanges: - Exchange Headers include Node Metadata - Exchange Body is Content InputStream app_icon = space-icon-default Aspects = [cm_titled, cm_auditable, sys_referenceable, sys_localized, app_uifacets] Associations = [] AssocType = sys_children breadcrumbId = ID-demo cm_created = T07:49:30.593Z cm_creator = System cm_description = The company root space cm_modified = T07:49:38.096Z cm_modifier = System cm_name = Company Home cm_title = Company Home InheritPermissions = false NodeEventType = UPSERT NodeRef = 814a8066-6acd-44c8-a2e5-08ac d Path = PermissionHash = ab54c3154b40bb5b741d4fd8ae0ca32370daf454 PropertyHash = d7152e8d2455a03a321ee45ee9dd2e0f SecondaryParentAssociations = [] SetPermissions = [{"permission":"Consumer","accessStatus":"ALLOWED","authority":"GROUP_EVERYONE","authorityType":"EVERYONE","position":0}] Site = null sys_node-dbid = 13.0 sys_node-uuid = 814a8066-6acd-44c8-a2e5-08ac d sys_store-identifier = SpacesStore sys_store-protocol = workspace Type = cm_folder On the other side, the camel component, we consume the JSON endpoint and convert each node into an exchange. The component is also responsible for tracking the current transaction block, so it knows where to pick up each time it polls the remote endThe exchange headers include all the properties of the node. The exchange body is an Input Stream in Java terms, which is the content of the node. This Input Stream is lazy though, it does not open a connection to retrieve content until it is read, so you will often times not need to read the inputstream at all.

15 AlfStream Producer– Camel ComponentConverts Exchange to Multipart Form POST Submission (Optional) Checks to see whether Node exists first by using Property and Permission Checksum Uploads Exchange Body as Content Data if Present Not Limited to AlfStream Consumer – Can use any Camel Exchange Type (Such as the File Consumer) We also have a producer which is responsible for submitting updates to Alfresco. This is a simple multipart/form-data request, but does an optional check to see whether the remote end already has the node, as an optimisation to prevent large file uploads.

16 AlfStream Producer– Alfresco Repo AMPMultipart Form Data interface for submitting Nodes to Alfresco Ensures the Node’s state is update as per the Request This includes changing (If necessary): Properties, Content, Permissions, Aspects, Peer and Parent Associations, Locks and Version Labels For Properties: Deserialise the the form request, converting into QName and Native Java Type based upon Content Model For Content: Update cm:content property based upon uploaded file The last portion is the update webscript in Alfresco. I saw a discussion on the term the other day around a "Fix Everything" design pattern. It's a big encompassing function that is responsible for getting a system into a state you want.Idempotence once again lies at the heart of this type of design.We update properties, permissions, content, associations, locks and version labels if present, and get the node into a state we want it.

17 Environmental ChallengesPractice and Theory Environmental Challenges So that sounds great. In theory. How does it work in practice and what issues did we arise in implementing this?

18 User Configured SynchronisationChallenge Users should be able to add and remove folders from sync easily, without having to readjust the Camel Route each time. Solution Create an Aspect that cascades down to child nodes on application. Adjust the route to only listen for nodes with that aspect. The first challenge we had was around users being able to select what nodes are synced and what aren't, sort of like cloud sync. We addressed this by creating an aspect that cascades down on folders and then adjust the camel route to filter out all events that don't have that aspect applied

19 Preventing a Feedback LoopChallenge When one Alfresco Instance is Updated, it generates an Exchange that the originating instance receives. This can cause an Infinite Feedback Loop Solution Skip Exchanges that have already been processed. Track equivalent Exchanges based upon Node UUID and Modification Time The next challenge was around the two way synchronisation. As an update in one system would cause the originating system to receive an event, we needed a way to stop this feedback loop otherwise it would just go on forever.The initial way was we kept cache of already processed nodes and their modification times

20 Updating Nodes ChallengeModification Time is not always updated when changes are made (I.e, when a Node is Locked, or ACLs are Updated). This causes some Exchanges to be ignored when they should be processed Solution Generate a Node SHA Hash for both Permissions and Properties for equivalence. As a default use Modification Date, Lock Type and Version Label as inputs for the Property Hash (converting them to their byte values) This led to another issue. Sometimes nodes can change, but their modification dates don't, such as when you lock a node. We could no longer trust this field to reflect all changes.The solution here was to create a hash of both the permissions and properties using SHA, and utilising that to reason whether it's a duplicate event or not.

21 Permission AuthoritiesChallenge Authorities may not exist on both instances. This means that the Permission Hash may not be equal on each instance Solution Generate an Authority within the Update script so that the permission hash is always equal Authorities, if we want to ensure they are synced, needed to be created on the other end if they don't exist, otherwise our permission hash will be different on either side

22 Permission Changes ChallengeWhen you update the Permissions of a Node, this is not done within a Transaction: It is done within an ACL Change Set. This means that Exchanges aren’t generated when ACLs of a Node are changed. Solution Track ACL Changesets as well as Node Transactions, generating events if either one changes. The Permissions of a node are not updated in transactions. They use ACL change sets.Now we need to track against both transactions and acl change sets.

23 Version Numbers Sync ChallengeWhen you receive an Exchange and update a node, the version number may be different at the other end (I.e, Major Update instead of Minor). Solution Adjust the Version Service to be able to Provide the correct Version Label There is no existing way in the Version service to force a particular version number, so we had to create a way of allowing this to happen.

24 Restarting the Route ChallengeWhen you Restart the Camel Route, the AlfStream consumer will begin from the beginning. This can take a long time if there are 1000s of Nodes to process. Solution Allow the AlfStream producer to persist transaction ids and changesets to a file so it can pick up where it left off if it restarts Sometimes you need to restart the route from the beginning. Other times you want to pick up where you last left off. If you stop the route it would just start from the beginning. Duplicate events are skipped though, so this doesn't cause any issues in terms of the state of the synchronisation, but it is time consuming checking every existing node over and over again.So we implemented a simple persist file that is read and written to, which can be activated if need be

25 Quick Demo So those were some of the challenges we ran into when providing a two way sync of Alfresco using apache camel. I'm glad to say that we have done multiple deployments of this solution, and reused Alfstream in a variety of different ways that I haven't had time to mention today.Where are we going from here and what is the next step?

26 Changes and Updates to AlfStreamLooking Ahead Changes and Updates to AlfStream So those were some of the challenges we ran into when providing a two way sync of Alfresco using apache camel. I'm glad to say that we have done multiple deployments of this solution, and reused Alfstream in a variety of different ways that I haven't had time to mention today.Where are we going from here and what is the next step?

27 Full Site SynchronisationChallenge Sites are cached in Alfresco Share have cached configurations. This means that updating it within the Repo End does not reflect the changes from the Front End Solution Force Share to reset its cache when changes to the dashboard configuration take place Firstly, we'd like to be able to synchronise full sites. At the moment AlfStream can handle the node types and data which would allow a full site synchronisation, but Alfresco Share caches the site configuration, so we need a way of updating Share when changes are made

28 Transaction Level ExchangesChallenge Groups of nodes need to be updated atomically within the same exchange. This prevents things like Folder Rules from Syncing correctly Solution Allow the consumer and producer to handle and update multiple nodes within the same transaction block At the moment it is one node per exchange, which can cause issues for entangled nodes such as folder rules which actually span across multiple nodes and have mandatory children within the content model. Although it is a bit of an edge cache, I'd like to ensure that we can do transaction level updates, updating all nodes at once, to allow for support for this.

29 SaaS Storage IntegrationsLastly we want to expand our configuration examples to use it with the popular SaaS storage offerings such as those listed here. As we can already sub-select nodes we want synced, it won't be too much of a stretch to configure camel to integrate into these services.

30 Conclusion As a conclusion we have learnt a lot of things with this particular use case

31 Conclusion Synchronisation between systems is a very common use caseApache Camel provides a platform for creating Routes and Integrations and abstracting away common integration paradigms Apache Karaf + Hawtio provides a base for managing Camel Routes and hot deploying changes Camel allowed us to create custom component to handle Consuming and Producing from Alfresco to handle our existing and future use cases Integration is always more challenging than you think! One is that synchronisation is a very common usage scenario we have come across in a variety of formsApache Camel is a great platform for building Integration Patterns, and customising it to suite your needsKaraf + Hawtio gives the best of all worlds in terms of having an ESB-like solutionAnd lastly Integration is always more challenging than what you think!

32 Speaker contacts Website: https://www.parashift.com.auGithub: https://github.com/cetra3/ I'll be around for the rest of BeeCon if anyone has any further questions, around this solution or you want to talk about anything else that Parashift is doing.If not, please feel free to contact me via if you want to discuss your own implementations or have any further questions

Alfresco Two-Way Sync with Apache Camel

Recommend Documents