We reached back to IBM Cloudant’s roots in physics research and Apache CouchDB, speaking with Zdenek Maxa, Senior Software Engineer at California Institute of Technology, to discuss his time at CERN and his decision to move his database from Oracle to CouchDB. It was a pleasure chatting with him. Zdenek had some interesting examples to share, and we hope you learn from his work translating Oracle schemas to JSON documents with binary attachments in CouchDB.
While he’s not a Cloudant customer, Zdenek’s work with CouchDB allowed physicists around the world to schedule grid-computing jobs for Large Hadron Collider data. Each job could process experiment data-sets several terabytes large. The application Zdenek oversaw handled more than 30,000 of these jobs in its lifetime and was responsible for managing the metadata associated with executing job requests on the grid.
In the end, Zdenek had a single-server CouchDB installation on the order of hundreds of megabytes of metadata stored. But he inherited a back-end architecture that relied on both Oracle and CouchDB in production. He was tasked with simplifying this architecture. The bulk of the work was writing Python utilities to retrieve data from Oracle and CouchDB, compare it, and store the new fields in CouchDB. Writing this code took one-and-a-half weeks, and fixing issues in the legacy application that Zdenek managed took several more. Read on!
Cloudant: How did you get started at CERN?
Zdenek: I started at CERN almost 10 years ago, as an employee of University College London. Before working for the Compact Muon Solenoid Experiment (CMS) at CERN, I was working on a competing CERN experiment called ATLAS. Much like at the CMS group, we were working to verify the existence of the Higgs Boson particle. At ATLAS, my job was to write Java code that displayed the graphics for particle interactions. Then, four years ago, I joined the CMS, where I worked under the co-leadership of Simon Metson and Dave Evans on the experiment’s Data Management and Workflow Management (DMWM) project. My role at DMWM is to develop software with Python and CouchDB to manage processing job requests to run in the worldwide computing grid.
DMWM is a set of projects to define and implement workloads on the Worldwide LHC Computing Grid. Around the time that Simon, Dave, and others left the project, I became responsible for the DMWM application called the “request manager.” You can think of it as an entry point into the CMS distributed workload management service. Using the request manager, CMS data operators can specify the properties of computing jobs they want to process on the grid: where to run them, which jobs to run on which datasets, where to store the results, and so forth.
I should add that in the context of this discussion, a “request” is not a request to a Web server; it represents a job-processing entity stored in a database.
Cloudant: When did you realize the project wanted to move off of Oracle?
Zdenek: Historically, this project was backed up by both Oracle and CouchDB. I think it was mostly Simon who started introducing more CouchDB into the projects — not only into request manager, but also into the software agents that control the processing of jobs grid-side. This is how the entire DMWM CouchDB story began. We started storing more data in CouchDB, but Oracle was never removed. Unfortunately for request manager, the application was depending on both databases.
Architecturally, it’s a shortcoming of design to have two database back ends and not be cautious about keeping both consistent with each other. This was my major headache. When I became responsible for request manager and assessed the inherent technical debt that had accumulated, I decided to start over and re-implement the application from scratch.
The first step was to drop Oracle from the picture and migrate the data over to CouchDB, but CouchDB didn’t have all the data. Some of the data items for the requests, the work identities, did not agree. My job was to dump the data from Oracle, which is easier said than done.
CERN’s Oracle instance was missing several useful tools. I have experience with MySQL and PostgreSQL and am used to having a dump utility, where you dump everything into the SQL commands and store the results as a flat text file. Then you can do whatever you want with it in Python. For various reasons specific to our project, nothing like that existed in the CERN installation of Oracle.
While I could have spent time installing and configuring the Oracle software we needed to complete this process, writing the tools myself was ultimately more instructive for my task. It took me a couple days to extract all the data using the Python Oracle libraries and export it to JSON. With the JSON and check utilities I wrote, I could cross-reference all the discrepancies between CouchDB and Oracle.
On a document level, it was quick to clean up. Some records only existed in one database and not the other. For example, CouchDB had documents that did not exist as Oracle records because of errors that occurred when queries were injected into the system. Going one level deeper, at the field level in each JSON document, I had to compare all the data in CouchDB and identify any inconsistencies with Oracle. Here, I copied all the fields missing in CouchDB from Oracle and into CouchDB.
Cloudant: How did you copy your Oracle schemas into CouchDB? Did you have to write custom MapReduce code?
Zdenek: No. Actually, we didn’t need to use MapReduce for the migration. From the CouchDB point of view, I was querying document-by-document. I wasn’t using MapReduce for that. I did a full database scan and retrieved everything from Oracle, and because of the way we planned our schemas, we had request ID fields that were common to both databases. So I could match each CouchDB document to its corresponding Oracle record. By default, CouchDB assigns its own unique document ID field, but in the case of request manager, we added arbitrary ID fields to each CouchDB document that stored the name of the request from our request manager application.
Cloudant: So this was like a giant meta JOIN on all records in Oracle and CouchDB?
Zdenek: Yes, in a sense.
When I was done, our new implementation, “request manager 2,” used only CouchDB on the back end, but it was shared with the legacy request manager application, which was still running in production. So it was necessary to resolve inconsistent data that arose in the legacy application in order to maintain a healthy CouchDB for when request manager 2 would be deployed. And that was the end of the move from Oracle to CouchDB.
In my opinion, there was no point running an application that depended on two different database back ends. It was a recipe for future headaches. Moving off of Oracle was the right thing to do.
Cloudant: What advantages were brought to the project by migrating to CouchDB?
Zdenek: I’d have to go back to Simon’s time, when he and Dave introduced CouchDB into the project, but it came down to storing lots of binary attachments that could be easily shared with other applications in the CMS. That was the main reason for the move.
Although we’re talking about a “request” as a job-processing specification for the CERN computing grid, the actual job specification data is represented in CouchDB as an attachment to each document. We call this data a “workload management spec file,” which is a serialized Python object instance (i.e., a pickle). Because many other CMS applications were already using CouchDB — or were planning to use it soon — it was very natural for request manager to store these spec files as binary attachments.
Of course you have a BLOB datatype in Oracle to store binary data, but I didn’t see any easy way to use it to share our data at large scale. From my experience, Oracle would become bloated and slow when working with tens of thousands of BLOBs.
Sharing data was our main goal, but CouchDB’s NoSQL approach also better suited our philosophy on data management. With each request represented as a separate JSON document, and the workload management spec file attached to it, CouchDB neatly compartmentalized our data.
The next thing is that, every so often, it was necessary to change the Oracle schema. For us, each change was troublesome and painful. The way we made applications at the CMS, we were used to producing three different flavors of each release: 1) development, 2), test bed and 3) production. Developers write all their code in their configuration-managed private VM that simulates production. The test bed is used when developers need the CMS data operators to test a change to the system. Finally, the software is released into production.
So assume you want to change the Oracle schema. You would have to go through and modify the schema in each Oracle instance that serves as the back end for that particular deployment flavor. That’s what we did, three times, for every environment of each release. In CouchDB, you simply check to see if a certain field exists, and that’s it. There is no schema to enforce.
Cloudant: Why did you have to keep changing your Oracle schema?
Zdenek: One example would be when a CMS data operator wants to implement a new data type for a processing job. They require some specific attributes, and these attributes need to be captured by extra columns in Oracle. This example is just one feature request implementation scenario that led to schema changes. Another example is a schema clean-up when removing or modifying a feature directly dependent on the data layer.
Cloudant: Did you have to make any code changes?
Zdenek: While we had to make modifications, these were mostly removing code that was no longer needed rather than writing new code. As a matter of fact, CouchDB led to great simplifications because all the data access objects (DAOs) needed for various Oracle tables were gone — which was a lot of code. We could get everything we needed directly from CouchDB.
Cloudant: Thank you!
Zdenek actually opened this interview by wishing me a happy birthday on Skype, so it was a good day all around.
This summer, Zdenek moved on from CERN, but development work continues at the CMS using the groundwork he established with CouchDB. Zdenek also told me that although he ran CouchDB on a single server for request manager, with a future release of Apache CouchDB™ slated to include Cloudant’s BigCouch clustering capabilities, the CMS is considering validating future versions that could run CouchDB in a clustered environment, so other applications in the CMS Web Group can more easily share the infrastructure.
A big thank you to Zdenek for spending time with us, and good luck with what’s next!