Spring Batch Deployment Example

Jul 15th, 2010

Deployment has always been a tricky part of Batch Processing. Unlike a Web application, there is no standardized deployment, and there are a variety of environments that could be deployed to. You could actually deploy a Batch application within a web container, or as a standalone java app to be started by one of many available schedulers. For this reason, everyone’s environment is different, and there can be no one example that someone can use as a starting point. However, I have done enough standalone deployments for Linux using Bash to share a simple example.

The job itself matters little for the purposes of this post. A simple one tasklet job suffice:

   <job id="sampleJob" job-repository="jobRepository">
       <step id="simpleStep">
           <tasklet ref="tasklet" />
       </step>
   </job>

   <beans:bean id="jobRepository" 
   class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean"/>

   <beans:bean id="transactionManager" 
   class="org.springframework.batch.support.transaction.ResourcelessTransactionManager" />

   <beans:bean id="jobLauncher" 
   class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
       <beans:property name="jobRepository" ref="jobRepository" />
   </beans:bean>

   <beans:bean id="tasklet" class="net.lucasward.sample.SampleTasklet" />

(namespace removed for readability)

The Tasklet simply prints ‘Hello World’

public class SampleTasklet implements Tasklet {

   public RepeatStatus execute(StepContribution contribution, 
                               ChunkContext chunkContext) throws Exception {
       System.out.println("Hello World");
       return FINISHED;
   }
}

Simple enough, right? Now for the hard part. If we want a scheduler to be able to launch this from the command line, what do we do? The first problem to solve is, what does our deployment look like? In my mind, there’s three necessary components:

The jars themselves, which need to be on the classpath
A Script that can be called
The xml and/or .properties files that will be used for configuration

Personally, I prefer to separate the XML files from the jars, to allow for tweaks to the Job. This is especially useful for Spring Batch job definitions, although may be less so for normal spring configuration files. My preferred layout is to have three directories: bin, lib, and resources. Obviously, the scripts go in /bin, jars in /lib, and xml/properties in /resources. It doesn’t really matter how you break yours up, but it’s the format I’ll be using. In order to create this layout, I’ll use Maven and the assembly plugin:


          <plugin>
               <artifactId>maven-assembly-plugin</artifactId>
               <version>2.2-beta-2</version>
               <configuration>
                   <descriptors>
                       <descriptor>src/main/assembly/descriptor.xml</descriptor>
                   </descriptors>
               </configuration>
               <executions>
                   <execution>
                       <id>make-distribution</id>
                       <phase>package</phase>
                       <goals>
                           <goal>single</goal>
                       </goals>
                   </execution>
               </executions>
           </plugin>

And the Descriptor:

<assembly>
   <id>distribution</id>
   <formats>
       <format>tar.gz</format>
   </formats>
   <includeBaseDirectory>false</includeBaseDirectory>
   <fileSets>
       <fileSet>
           <directory>src/main/scripts</directory>
           <outputDirectory>bin</outputDirectory>
           <useDefaultExcludes>true</useDefaultExcludes>
       </fileSet>
       <fileSet>
           <directory>src/main/resources</directory>
           <outputDirectory>resources</outputDirectory>
           <useDefaultExcludes>true</useDefaultExcludes>
           <filtered>true</filtered>
       </fileSet>
   </fileSets>

   <dependencySets>
       <dependencySet>
           <outputDirectory>lib</outputDirectory>
       </dependencySet>
   </dependencySets>
</assembly>

The format I’m using is tar.gz, since I’m targeting linux, but there are many more available.

The two fileset entries describes both the bin and resources directory. (I haven’t talked about the script yet, but I will below). The ‘DependencySet’ reference is to transfer all the dependencies that Maven is managing into the lib directory, including the created jar itself. When you run ‘mvn install’ there will be two artifacts created: The normal jar, and another file with the same name, ending in -distribution.tar.gz. In the case of my example that is: batch-deploy-sample-1.0-SNAPSHOT.jar and batch-deploy-sample-1.0-SNAPSHOT-distribution.tar.gz. In my example, unzipping gave me the following:

./bin:
sampleJob.sh

./lib:
aopalliance-1.0.jar                           spring-batch-core-2.1.1.RELEASE.jar           spring-tx-2.5.6.jar
batch-deploy-sample-1.0-SNAPSHOT.jar          spring-batch-infrastructure-2.1.1.RELEASE.jar stax-1.2.0.jar
commons-logging-1.1.1.jar                     spring-beans-2.5.6.jar                        stax-api-1.0.1.jar
jettison-1.1.jar                              spring-context-2.5.6.jar                      xpp3_min-1.1.4c.jar
spring-aop-2.5.6.jar                          spring-core-2.5.6.jar                         xstream-1.3.jar

./resources:
jobs

./resources/jobs:
sampleJob.xml

All we need now is a simple script to actually run the job:

#!/bin/bash

CP=resources/

LIB=lib/*
for f in $LIB
do
CP=$CP:$f
done

java -cp $CP org.springframework.batch.core.launch.support.CommandLineJobRunner \
jobs/sampleJob.xml sampleJob

I’m probably not going to win any awards for my bash scripting skills anytime soon, but it gets the job done and isn’t quite as archaic as a more concise version would be. Essentially, I’m creating a classpath from all the jar files in /lib by looping through the files and separating them with a colon. Once that is done, I can start a java process, using the CommandLineJobRunner as the Main method. As described in the documentation, all I need to pass to the job runner is the xml file to run the job, and the job name. (It’s worth noting that normally you would also need JobParameters, but since I’m using the MapRepository, it isn’t necessary)

You can download the running example from my github account: batch-deploy-sample.

Comments

Lucas Ward

Interesting. I'll definitely add your code for the script directory. I'm not a big fan of using the CLASSPATH variable, so I think I'll stick to just passing in the classpath to the command line via -cp

Philippe

I slightly corrected the shell script so that it can be launched from anywhere, rather than having to cd to the bin directory.
Here is what I did :

# Set the classpath
SCRIPTDIR="$( cd "$( dirname "$0" )" && pwd )"
CLASSPATH=.

for file in `ls $SCRIPTDIR/../lib`
do
export CLASSPATH=$CLASSPATH:$SCRIPTDIR/../lib/$file
done
export CLASSPATH=$CLASSPATH:$SCRIPTDIR:$SCRIPTDIR/../resources

# Launch the conversion job
java org.springframework.batch.core.launch.support.CommandLineJobRunner …

Philippe

Thanks a lot ! It's exactly what I was looking for and it works like a charm…

Philippe

This comment has been removed by the author.

Dil………………………………………:)

I want to stop the running spring batch job, Please let me know how can i do that .. i wil b follooing u r blog fr ans

Lucas Ward

I'm sorry for the crazy late response, I've been extremely busy with client work for the last few weeks. Are you still having this issue?

David Tam

Thanks, this is very useful for someone totally new to Spring Batch, but when I run the job.. I get this..

ClassPathXmlApplicationContext [INFO] Refreshing org.springframework.context.support.ClassPathXmlApplicationContext@4ca31e1b: display name [org.springframework.context.support.ClassPathXmlApplicationContext@4ca31e1b]; startup date [Fri Oct 22 10:28:52 CDT 2010]; root of context hierarchy
2010-10-22 10:28:52 XmlBeanDefinitionReader [INFO] Loading XML bean definitions from class path resource [jobs/sampleJob.xml]
2010-10-22 10:28:53 CommandLineJobRunner [ERROR] Job Terminated in error:
org.springframework.beans.factory.BeanDefinitionStoreException: Unexpected exception parsing XML document from class path resource [jobs/sampleJob.xml]; nested exception is java.lang.IllegalArgumentException: 'beanName' must not be empty

Its probably something simple I'm missing.

Test Code Quality

May 24th, 2010

I’ve been meaning to write about this for awhile, but it always seems so contentious for some reason. I’m sure we’ve all seen the worst examples before. I once opened up a 5,000 line unit test of a Spring MVC controller. It got this bad because every bit of setup was cut and paste into each new test function. Of course, it should have been a sign to the original author that refactoring was needed when there was that much setup required in the first place. But it still begs the question: What level of quality should be expected of test code?

I’ve worked with lots of great developers that would see so much as a line of duplication and take it upon themselves to drop everything, including the donut in their hand, to remove it from code. However, when it comes to test code, that same developer would shrug their shoulders and say: ‘It’s only test code’. I just don’t get this. If lack of reuse and duplicate code would cause an aneurysm in normal code, why is it acceptable in test code? For that matter, why would the way you treat test code be any different at all?

When we write unit tests, it’s code we intend to last just as long as the class it’s testing, or at least until the assumptions we made about the code have changed. And given how slowly some companies could change even those basic assumptions, that could mean never. Testcode still needs to be easy to maintain, and easy to understand.

Here’s some of the reasoning I’ve heard:

I don’t want to make the unit test complicated, it should be easy to understand the intention

If abstractions and code reuse done properly can make code easier to understand, why would the same not apply to test code? How would duplication between two unit test classes aid in the clarity of the individual test?

Unit tests should be completely isolated from each other

I’ll agree to this in principle. But I think it can be taken too far. Unit tests should be able to be run side effect free. You should be able to run each individual test by itself, or with any number of other unit tests without the existence or absence of such tests causing failures. However, this has nothing to do with sharing abstractions with each other, or reusing the same setup data if more than one class under test uses the same domain objects. As long as you’re creating new instances in your test classes, and not accessing them through stateful static calls, there should be no issue.

Modifying one test to pass shouldn’t cause another to break

This is a tricky one that has bitten me numerous times before. If you create some kind of abstraction, such as some type of a builder to create a set of test data for multiple tests, you run the risk of breaking a lot of tests if you change something in the builder to make one pass. It can be extremely disheartening to watch the unit test you were working with pass, only to see hundreds of test failures when you run the entire build.

However, despite the pain this has caused me in the past, doesn’t this seem like a normal coding problem? Maybe it’s coupling/cohesion, or demeters law, or any other random law. There’s probably a perfectly well known refactoring mechanism to make this test code easier to modify without side affects. There is no reason not to apply them to your unit test code as well.

Now though, there comes the dirty part, that only an ‘Enterprise’ developer will know the unfortunate reality of. What if your test code looks nothing like your real code, but there’s nothing really wrong with it except it breaks many of the largely nonsensical ‘coding standards’ put in place by some type of ‘governing body’? What if *gasp* its better as a result?

Comments

Dokemion

When you are trying to learn java you have to under go training.

Boiler Plate Code and Closures in Java

May 19th, 2010

While working on a project recently, I ran into the following code:


public SomethingFromDatabase getSomethingFromDatabase(){
  Element element = getDataFromCache();
  return (SomethingFromDatabase)element.getValue();
}

private Element getDataFromCache(){
  Element element = cache.get(KEY);
  if(element == null){
    element = new Element(KEY, dao.getSomethingFromDatabase());
    cache.put(element);
  }
  return element;
}

Assuming cache is an instance of net.sf.Ehcache and the dao is your run of the mill Dao.

There’s some obvious problems with this code, the one that stood out the most to me is the boilerplate code. The solution I used to solve the problem, which is common, is to use closures:


public  T get(String cacheKey, CacheDataClosure closure){
  Element element = cache.get(cacheKey);
  @SuppressWarnings("unchecked")
  T cachedObject = element == null ? null : (T) element.getValue();
  if(cachedObject == null){
    cachedObject = closure.getData();
    cache.put(new Element(cacheKey, cachedObject));
  }
  return cachedObject;
}

public interface CacheDataClosure {
  T getData();
}

A similar approach is used by Spring in it’s *Template classes. Refactoring the original code to use the Facade, would look something like:


private Foo getDataFromCache(){
  return cacheFacade.get(KEY, new CacheDataClosure(){
     Foo getData(){ return dao.getFooFromDatabase(); }
   });
}

I’m sure to some people reading this blog, the solution seems obvious, and almost not worth mentioning. However, over the years I have seen many interesting solutions to the same problem, especially in the database world, but elsewhere as well. You would be surprised how many people solve this problem with inheritance by instinct. Even those who have used frameworks that require a closure in certain scenarios, don’t necessarily think about what it is they’re using, and don’t think to use it when a problem arises later. Despite the fact that closures are ugly in Java, they are still the best solution for allowing for separation of concerns in a lot of scenarios, and reducing boilerplate code.

Comments

Lucas Ward

Thanks for pointing that out. I hadn't seen it before. I'm still not sure that I get it though, as I don't see a way for someone to define how the cache is refreshed, which is the main problem I had with the original code. When the cache is stale, you have to populate it yourself, and it's all boilerplate. As a user of ehcache, all I really want to tell them is how to get my data. Unless I'm missing something when looking at the Javadoc? (Which is highly likely)

I do think from looking at it that I may need to go back and think about concurrency. I haven't looked at the ehcache implementation, but I assumed it blocked. I was simply refactoring the earlier boiletplate code.

David Dossot

You could also consider using EHCache's SelfPopulatingCache construct (granted it's cache specific while your approach is generic).

SCM Continued

Feb 19th, 2010

In my last blog post I talked about a maturity model for source control systems. Martin Fowler has recently posted a blog entry that covers the same topic:

http://martinfowler.com/bliki/VersionControlTools.html

One of the reasons that Paul and I thought a maturity model for SCM was interesting, was that it brought objectivity to the debate.(or at least tried to) It seems far too subjective (and easy) to break SCM systems down into two categories: usable and unusable. For me personally, I would list ClearCase in the ‘unusable’ category. However, despite it’s shortcomings, thousands of programmers use it on a daily basis. Therefore, it can’t be ‘unusable’. What it boils down to though, is that for me personally, given what I find important when developing, ClearCase is a tool that I feel prevents me from coding effectively. If someone else doesn’t hold these same values to be important, perhaps it changes what they view to be usable.

Martin takes an interesting twist with his article, in that while he still uses two groups, his dividing line is on ‘recommendability’. This is still a somewhat subjective measurement, especially since his focus group was us ThoughtWorkers. But after reading the responses to my last blog post, a common theme seemed to come from those using ClearCase. They were willing to defend it as ‘usable’, but not a single one that I read would recommend it. This mirrors some of my project experience as well. I’ve even known some ClearCase admins that would argue with me at length about how ClearCase isn’t that bad, as long as you ‘used it right’, and yet even they wouldn’t recommend it as a source control product. Perhaps it’s much more useful to ask a client if they would recommend their SCM to others, rather than arguing about why you don’t think it’s the best solution. Afterall, if they wouldn’t recommend it to others, why are they still using it?

A Maturity Model for Source Control (SCMM)

Feb 3rd, 2010

Most enterprise developers are already familiar with the Capability Maturity Model. Those of you with Agile tendencies may have also heard of the Agile Maturity Model. The purpose of these models is to objectively assess an organization’s maturity in a particular methodology. Despite any feelings you may have on CMM or waterfall in general, having an agreed upon way to assess the basic principles of a methodology can be quite useful. One area where this type of assessment typically falls under the radar is SCM. Many companies look at a particular SCM solution, and believe it covers every need. However, when objectively analyzing the various uses and features of SCM systems, a type of maturity model begins to emerge.

As with other maturity models, we can use a similar numbering system to delineate how ‘mature’ an organization or tool is, with a higher number being better. As you can imagine, zero corresponds to a complete lack of source control at all, or perhaps a shared file system and a roofing shingle. The highest level is reserved for the most advanced tools such as Git or Mercurial, that allow for advanced SCM capabilities such as zero-cost branching and merge through rename.

In doing this analysis, some common feature-sets emerge:

Atomicity: If you check in 5 files, and there is an issue with one, none of the five should be committed.
Revision Tagging: The ability to take a snapshot of the state of your codebase at a particular moment in time, and refer back to it later.
Branching: The ability to have more than one version of the codebase in active development, seperate from each other.
Merging: Taking different versions of the same codebase and combining those changes. This may mean changes from a local copy or between two branches, with a complete history of how these various pieces have been merged.
Merge through Rename: A more advanced SCM feature for dealing with merging files that have been renamed. Consider file A in branch 1 and 2 that is then renamed to B in branch 2, but not in 1. Then consider what would happen when A’s contents are changed in branch 1. SCM systems that support merge through rename allow for this and more, all others blow up.
Repository Navigation: The means of accessing the contents of the repository. Subversion, for example, has a web interface. (among many other third party tools) While Clearcase has thick client access.

Lack of SCM Bravery

Most folks in enterprise computing as of 2003, had experienced two SCM tools. It does not matter which two. The important point is that they remebered which of the two was the better one, and which was the OH-MY-GOD-I-NEVER-WANT-TO-USE-THAT-AGAIN one. They had taken that lesson, and were going to make damn sure they were never going to use the lame one again. Sadly that meant there were risk averse in respect of a third or fourth choice. Perhaps also as all the tools chains and workflows are different, there was caution based on that too.

Much has changed since then with experience of half a dozen much more common place now. However, if you walk into most enterprises today, you’re likely to see only one of two SCMs: Clearcase or Subversion. For those of us who have used many of the SCM systems below, (especially us consultants) it is hard to understand why someone would continue to use some of the SCM systems that are lower on the maturity model below. It can be easy to think that it is imposed upon them by the powers that be, however, in many cases they will tell you at length why their SCM is better, or at least why they don’t feel the need to switch. This stems from two fundamental issues:

They’re pervasive. Like an IDE, they’re there every second of every day, everything we do as programmers short of documentation needs to be checked in. It has a huge impact upon our day to day lives.
Steep learning curve. Regardless of the solution, and any third party tools helping to make it easier, source control in and of itself is complex. Switching requires that you accept a drop in productivity, and no one wants to become a newb again.

Five maturity Levels

Shadowing the CMM, we’re aiming at five levels above zero.

Level Zero - ‘No SCM’

No source control solution at all, or a shared file system with periodic backups. One developer, or a few at most share source without tools, and as such run a number of risks:

Source may not be compilable at any moment
Source may be lost because of developer error
Extremely easy for developers to overwrite each others changes.

There is nothing to redeem in this level, or build on for future ones.

Level One - ‘First Foray’

Developers have workspace on network and cannot work offline
Run a build means go to a long lunch
Refactoring - if it works at all, is deathly slow
Checkouts slow enough to be done over night
Checkins slow
Non-atomic commits
Branching and tagging expensive
Personal/local branching means second checkout
Centralized rather than distributed
Unusable or slow merge point tracking
Merge through rename - merge, rename not understood by tool
Repository can corrupt on occasion, high administrator/expert to developer ratio (1:10)

Tools with basic ability to checkout, version and lock files. Usually implies developers are working on the same code. Synchronization to head my be problematic depending on locked status of individual source files. Tools in this space may have scaling issues, and not work well over long distances. Renaming of resources may be hard to impossible. Branching and tagging may require permissions on triplicate stationary, and with a slaughtered lamb or incense stick or two (they take a while and eat disk).

Examples:

Visual Source Safe

The core of VSS’s problem is that it is not client/server and has not moved forward much in a number of years. Developers can hurt each other with locking files and even marking files as shared (an esoteric feature).

Clearcase’s Dynamic mode of operation

Though an enterprise tool that sold widely into enterprises, it is like wading through molasses (treacle for Brits) to use. It mounts a network share or two for you, and thats what you edit on, and compile against. Both IDEs and builds are slow as a consequence. Dynamic checkouts are also precarious as you may have more than one developer active whithin it, and for periods of time, you may observe the codeline to be non-compilable. Furthermore, it does not support atomic commits. Though three way merge is actually advanced, nothing else about Clearcase in dynamic mode (including UCM mode) makes you want to particularly recommend it to anyone for any purpose. Merges from one active branch to another can actually take longer than the time spent making the original commits to the donor branch. There is a inverse square law at work with these installs of Clearcase, in that capacity drops off exponentially as you add more staff to it, and attempt to get busier (more throughput) with it. Clearcase sales folks and consultants recommend this mode of operation. Anomalous for Clearcase (dynamic mode) when rated against the bullets above, is that it can merge through rename, which is the preserve of more advanced tools (see later).

Level Two - ‘Clunky’

Developers have local copies and can work offline
Local file system means fast(er) builds
Refactoring - will go through; make a cup of tea
Checkouts will trickle past as if in bullet time
Checkins potentially still slow
Non-atomic commits
Branching and tagging potentially expensive
Personal/local branching means second checkout
Centralized rather than distributed
Unusable or slow merge point tracking
Merge through rename not working, requires extensive follow up conflict resolution before commit
Repository can corrupt on occasion, administrator to developer ratio sub optimal (1:20)

Examples:

CVS

For the longest time, CVS was the default internet-community SCM server. It has a variety of network protocols, and runs at a predictable speed over an arbitrary distance. It uses optimistic locking (it kinda introduced this), and developers find working offline a breeze. Commits back can still be like pulling teeth, in that every directory in a source tree is analyzed for differences with the server version via a wire call (just a hash but even so). Sadly, branching and tagging are still costly. Some folks still prefer CVS over others listed below.

TFS

Microsoft have tried to make something in Perforce’s image (they historically ran a private fork of Perforce from the 90’s called Source Depot), but do not have all of the features and performance of Perforce. They rammed in too many non-source control features and it is falling short of the mark in terms of installation and administration costs. Its pretty much wedded to Windows developers, and correspondingly leverages a ton of other MS server side pieces. Day to day integrated operations in Visual Studio are where you feel forcibly slowed down versus not using SCM. Agile folks approach checkout/checkin/conflict resolution with dread. This tool should be higher ranked based on features, but is a let down in implementation.

Clearcase static mode of operation

This is where Clearcase checks out a branch to your C:\ drive. Much like CVS, but a little faster. Builds run as fast as is possible for a local IDE drive (a gazillion times faster than a overloaded 10-base-T ethernet network with 50 devs on it). You’re not going to get hosed by some other developer rendering the branch non-compilable (phew!). At least not unless they checked in that broken state. You don’t automatically keep abreast of communal efforts - you have to sync/update periodically. Clearcase sales folks and consultants do not recommend this mode of operation, for some strange reason. Anomalous for Clearcase (static mode) when rated against the bullets above, is that it can merge through rename, which is the preserve of more advanced tools (see later).

Level Three - ‘Basic’

Developers have local copies and can work offline
Local file system means fast builds
Refactoring - speedy; smile at your neighbors for a second
Checkouts might finish before you die of old age
Checkins very fast
Atomic Commits
Lightweight (cost free) tagging and branches
Rudimentary branching and merging
Personal/local branching means second checkout
Centralized rather than distributed
Rudimentary merge point tracking
Merge through rename not working, requires extensive follow up conflict resolution before commit
Repository can corrupt on occasion, low administrator to developer ratio (1:100)

Examples:

Subversion

This is now the defacto standard install for enterprises, and since 2003 has been viable enough for the majority of open source SCM portals to install it, or at least have plans to. Compared to CVS its a definite advance. There is much improved speed. That is for normal checkin checkout operations and branching and tagging. It is a comfortable place for Agile developers to feel that their love for throughput is requited. Just don’t try to do frantic parallel development on more than one branch.

Level Four - ‘Effective and Reliable’

Developers have local copies and can work offline
Local file system means fast builds.
Refactoring - speedy; smile at your neighbors for a second
Checkouts reasonably fast.Checkins reasonably fast
No-op sync/update very fast
Atomic Commits
Lightweight (cost free) tagging and branches
Advanced branching and merging
Personal/local branching means second checkoutCentralized rather than distributed
Sophisticated merge point tracking
Merge through rename only possible with configured branch mappings, otherwise a fix-up before commit is required
Repository corruptions very rare, very low administrator to developer ratio (1:1000)

Examples:

Perforce

Fast is what this ten year old tool was built to be. In some operations nothing beats it. Back in the day, Perforce heralded a number of firsts in terms of capability (atom commits, lightweight branching/tagging, three-way merges).

This is bound to be controversial at ‘level 4’ as P4 uses read-only locks extensively.

Some Agileists are going to violently object as this mandates good tool support (Intellij juggles the read-only flags via perforce, but Vim does not). This severely limits the ability to work offline (making it anomalous versus the definition for level 4). You can still do it, you have to blast away the read only flags, and do a revert-unchanged when you reconnect later, but don’t try to do renames while offline - it’ll end in tears. Also refactorings (when there is tool support) will work, but will be slightly slower than for the likes of Subversion as network IO is happening per changed file.

Level Five - ‘Speedy, Invisible, and Highly Capable’

Developers have local copies and can work offline
Local file system means fast builds
Refactoring - speedy; crick your knuckles momentarily
Checkouts reasonably fast
Checkins reasonably fast
No-op sync/update very fast
Atomic Commits
Lightweight (cost free) tagging and branches
Advanced branching and merging
Highly efficient local/personal branches
Distributed rather than centralized, full audit on consumed contributions from distributed sources
Seamless merge through rename - no configuration needed
Sophisticated merge point tracking
Repository corruptions very rare, almost invisible administrator to developer ratio (1:10000)

Examples:

GIT and Mercurial

These two are very similar, and both have their fans. It is difficult to tell which will win out over the other in time. For both, the killer capability for Agile folks who treat previously committed code like wet-paint, is merge through rename. It works so well that you feel this is how Fowler intended refactoring to feel and that all other SCMs fall short on that vision. In short Git and Mercurial have the speed of Perforce, plus easy local branching, plus distributed operation, plus merge through rename.

Level Minus One - ‘Death Wish’

There is a special place in hell for PVCS Dimensions. We don’t know about the latest version which is rumored to have atomic commits, but we’e pretty sure that the turn of the millenium version was the worst of Clearcase but with a bucket full of suck thrown into the mix. For example, person A doing a sync/update in the morning would take 45 mins, but if person B started their sync/update after person A, then 1.5 hours would be the reality (even if nothing changed). Sometimes things were so bad, an ad-hoc distributed mode would leap into being (devs putting sets of changed files on network mounts, floppy disks or emailing). As far as the authors can recall, there is 100% correlation between PVCS-Dimensions use and failed projects.

Anomalies

Subversion with Git as a front end. Git has built in support for Subversion servers. Given some time, Git can clone a whole Subversion repository to a local workspace. From there you get the quick branch juggling capability, as well as local commits that can be sent back to Subversion later via ‘dcommit’. Speedy branch local juggling (a level 5 gain) is possible just like for Git proper. However, Git as a front end for Subversion cannot participate in Subversion merge point tracking (yet), thus we cannot back-implement another level 5 feature - merge through rename. Lastly, Git fronting Subversion is still mostly tied to a single server, so cannot claim to be distributed in the text book sense of the word. Where should it be placed ? 3.5 or 4 perhaps in terms of maturity, but merge lets it down.

Seamless merge through rename is the high bar

You’re most likely to be using Clearcase in the enterprise today, and may have heard that a shift to Subvesion will be more productive (and a lot cheaper). Though Subversion is enterprise approved now and taking over from Clearcase as #1 soon enough, it is time to look for new tools. Git and Mercurial mark the high bar now in many ways. However one feature stands out for Agile teams chasing high throughput - seamless merge through rename. If Perforce wants some of its old empire back, it is going to have to do this feature which could well be hard for them. The Subversion team already has it scheduled.

This blog post was the result of much discussion and pair blogging with Paul Hammant.

Comments

moirajohn

In this blog is very nice.so great information to in this blog….

Model agencies in london uk | Modeling agencies

twic

For the sake of completeness, here's how Mercurial deals with Michele's third scenario. Here's the script:

echo VERSION:
hg –version | head -1

echo
echo SETUP:

rm -rf michele
hg init michele
cd michele

echo "one" >test
hg add test
hg commit -m "first commit"

hg branch nbr
hg mv test test_nbr
echo "two" >test_nbr
hg commit -m "renamed to test_nbr and edited"

hg update default
hg mv test test_default
echo "three" >test_default
hg commit -m "renamed to test_master and edited"

echo
echo MERGE:
hg merge nbr

echo
echo FILES:
grep . *

Here's the output:

VERSION:
Mercurial Distributed SCM (version 1.8.2)

SETUP:
marked working directory as branch nbr
1 files updated, 0 files merged, 1 files removed, 0 files unresolved

MERGE:
note: possible conflict - test was renamed multiple times to:
test_default
test_nbr
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
(branch merge, don't forget to commit)

FILES:
test_default:three
test_nbr:two

So, it does the same thing as Git, but it does at least warn you about it.

ian.ringrose

I have seen ClearCase dynamic views work very well and be very fast. However they don’t work well for .net or jave development.
The power of ClearCase dynamic views is when they are combined with oMake for C/C++ development. oMake will avoid the call to the compiler if someone else has already done the compile with the same options and header file. The dynamic views allow oMake to totally track the dependencies of every object file – a great benefit when using C/C++ with often tangled header files.

(As to the number of ClearCase admins, I have often seen them run the entire build server farm, automatic tests as well as all the make files – a lot of work if you are building for many platforms. – Once again ClearCase was designed for a different world to what most of us live in these days)

Preston L. Bannister

Seems you left out a dimension - support for GUI-oriented developers. Git and Mercurial were quite immature in this aspect, and are still somewhat lesser.

Since this can be the majority of interaction, the impact on adoption can be painful. Depends on the group of developers, and the difference between resources present and what is needed to push up the learning curve.

Andreas Krey

@Michele: The results of scenario #3 depend on how you change the file in each branch. git looks at the contents of each files (esp. those that appeared or vanished) to detect renames; if they are sufficiently similar, it calls that a rename, otherwise it treats them as an added and a delete file, and performs the merge accordingly.

(And it also only looks at the sum of all changes from the merge root to the tips, meaning that it does not matter whether you added the new and deleted the old file in the same commit or in separate ones.)

ketchup

@Michele

Scenario 2 raises a conflict since the "line" holding the filename was changed to "test_master" on branch master and to "test_nbr" on branch test_nbr. Same line, concurrent changes, so it's a conflict. At least that's how I would explain it. :)

I tested your scenarios 2 and 3 on Cygwin 1.7.1 with git 1.6.6.1. I could run the scenarios on some more OS including Solaris and AIX, but as I understand it, the OS should not have any influence on the way changes and conflicts are detected (except maybe for POSIX file attributes), since only the content of the files are considered.

Michele

It's ok. I will pursue this, since I have to "defend" Git where I work. Because of the content vs file, I also guessed this is the reason of the conflict in scenario 2, but I really don't understand why. Btw, have you tested scenario 3, and if yes, on which OS?

ketchup

@Michele

Sorry if I appeared unfriendly, it was completely unintentionally. I just feel the right place for git support is not exactly the comment section of Lucas' blog (but thanks to Lucas to endure us!).

In short: Git does not track files, it tracks content. Imagine all you files punched into a flat file with a line indicating the original filename, then diff it. This way git can find a given text blob and track the changes to the text across file boundaries. It should explain the conflict in scenario 2.

As for scenario 3, I'm not really sure why it works. I'm not really sure if it should work or not, but it emphasizes my opinion that no technical solution can ever help a team of developers if they don't talk to each other (one way or other): If you start a refactoring, tell it to your mates. :)

But maybe you really should dig into http://git-scm.com for a authorative answer. These guys not only built git, they know git. ;)

Michele

Absolutely no fingerpointing. I really want to understand how this works. After I held a presentation at my work about Git - since many developers here are not happy with our current system (MKS) - a developer challenged me with this problem. MKS just crashes. I tried it several times - both on OSX and on Windows XP (mysysgit) - and always with the same result. I am curious to know exactly what you did. I have asked Scott Chacon directly, because I didn't find a good place to present the problem.

I'll describe the three scenarios I've used. I always start with a clean directory.

#Initialize
git init
touch test
"edit test"
git add test
git commit -m "first commit"
git checkout -b nbr #new branch

#Scenario 1 - rename file in one branch
git mv test test_nbr
"edit test" # Doesn't really make a difference
git commit -am"renamed to test_nbr"
git checkout master
git merge nbr
#Result in master: The file is renamed. No conflict.

#Scenario 2 - rename file in both branches
git mv test test_nbr
git commit -am"renamed to test_nbr"
git checkout master
git mv test test_master
git commit -am"renamed to test_master"
git merge nbr
#Result in master: Conflict. I don't understand why, but at least Git indicates that a file has been renamed.

#Scenario 3 - rename and edit file in both branches
git mv test test_nbr
"edit test_nbr"
git commit -am"renamed to test_nbr and edited"
git checkout master
git mv test test_master
"edit test_master"
git commit -am"renamed to test_master and edited"
git merge nbr
#Result in master: Two files, test_master and test_nbr, and no conflict.

ketchup

@Michele

I just tried to provoke the problem you described and failed: Git either correctly auto-merged the file (when no conflict araised), or flagged a conflict and did not merge. Then again, I'm using git a lot, and this problem never occured to me.

If you feel you experienced a bug, you should:
0. make sure it's a bug, not a feature, just to save you lots of time and trouble
1. write down how to reproduce the behaviour in detail
2. visit http://git-scm.com/ and make sure it's gits fault, either intentionally or not
3. contact the developers, see http://git-scm.com/

Point 0. is very very important, especially when you plan point fingers at git (as your statement seems to imply). ;)

You might be able to find a support shortcut with some heavy git users (e.g. Github), but beware to assume they provide universal git support for free. :)

Michele

A problem with Git?

Renaming and modifying a file in two branches, and merging them results in two files and no merge conflict. This means that Git doesn't give any information that we now have two files instead of one.

Ran

One thing missing in this maturity model is how these models support continuous integration. I have found out that using SVN is good step in practising CI because it has only one common repository and merging takes effort so people tend to move towards small commits.

ketchup

@Paul,
yes, my point: Admins should not merge. Merging code is essentially coding.

The merge-throu-rename-capabiblity of CC made it stand out in its time, as well as its support for massive parallel branches. But I suppose Agile and massive parallel works only with short-lived branches. Maybe the "content tracking" of modern DSCM helps, but, honestly, I don't know.

(The SCM problems are my main, if not only, objections against Agile. Glad to hear it works well with git and hg, have to try that in my next project.)

Finally, I agree dynamic CC probably spent even more network time than traditional working copy models. It is supposed to hide that from user by updating when the system is idle, which is probably why it never bothered me. (Never? No, that's a lie: I remember a project where I went ballistic because I had to wait for a compile for 45min, just because our project site was connected with 64kbit with the CC system. That's pain.)

But I'd like to think Linus was right as he said having a very very fast SCM changes the way we use the SCM. So don't take my comments on CC speed as an glorious aprisal, it's a simple statement that dynamic CC can be setup in a way which does not interfere too much in traditional software development. It can't support the way I'd like to work today anymore.

(Actually, thinking back, the highest risk in using dynamic CC is that nobody can work anymore if the CC server is down, with all the dire business implications.)

@Lucas,

RAD might be IBMs IDE, and CC might be one of IBMs SCMs. However, assuming they work together well, given their different lineages, is … bold. I would never dream they could be the same suite – maybe that's an important point. And please, please don't start on ClearQuest. I'd rather forget about that. :)

I used the CC plugin fo Eclipse some years ago. I can't remember exactly, but it might have had that strange behaviour already. If so, it's probably a weakness in the relevant Eclipse API, which might not support checking in and out folders, so they have to checkin the modified folder immediately (just my guess). That would render an otherwise great CC feature useless.

If it really matters, you can verify the correct behaviour in the CC Excplorer. Then again, it's purely academic, since you need to use the refactoring tools in you IDE, which obviously behaves wrong.

So I must accept the rename behavior is broken, at least for Java developers (which are possibly the majority of CC users nowadays?). I already lamented the missing change sets, so there's nothing to discuss left here. :)

I'm not sure about the branching argument. I like branches, I think CC and git do it right, while CVS and SVN are wrongly pretend there's no branching involved, even if every working copy effectively is a branch. However, I agree that you need short-lived branches for this to work well. Maybe you need to check on the private branches; I have no experience with them.

Finally about config-specs: Ah, I completely forgot about them! Fine stuff, esotheric, perfectly useable to confuse co-workers and drive developers crazy! ;)

In fact, I think config-specs are build- and configuration management responsibility, since they are part of the CC branching scheme. If a developer has to handle them, he does a job he's not paid for, hence he looses productivity. I stand corrected.

P.S.: Sometime you do something wrong, because you don't know better, sometime, because you must. With CC, at some point, it's usually because you must. And I'm not even telling you you're doing something wrong. Except, of course, if you use a 64kbit line and developers have to maintain config-specs, that would be wrong. :)

Lucas Ward

@ketchup

"sorry, my "wording" was not correct: "Biased" might be too harsh,…"

No worries. I'm currently learning French, and fully understand how confusing the small differences in meaning between words can be.

"- CC has atomic check-ins, but it has no change sets. You might be able to emulate change sets with labels, but that's hardly the same. Atomic check-ins just make sure a failed check-in won't ruin the repository (as it was possible in CVS), and does not help code integrity."

This is somewhat of a semantics argument. What I meant by atomic, was that if you checked in 10 files, and have some kind of issue with one, none of them should be checked in. I suppose we could call this atomic changelists or something, but it's beside the point.

"- Renaming a file in CC checks out the parent folder element and leaves in checked out. If your IDE checks it in immediately, your IDE is broken:…"

As far as I've ever known, there has only really been one clearcase IDE, RAD. At least that's where I used it. I know there's plugins for other IDEs, but I would be a bit scared to use them. I agree that the file will still be checked out. However, the file will still be considered renamed regardless. If someone else does an update, and you haven't checked in the files that required renaming as part of the refactoring, which you definitely won't, then the build is broken until you check in the result of the refactoring. Whenever I was working with CC I would always do renames first and check in immediately to avoid this issue. It's not the end of the world, but still annoying.

"-… CC has private branches. And: You will always have to merge if two developers concurrently worked on the same object…"

Personally, that's a problem for me. This ultimately leads into proliferation of branching and away from trunk based development. Although, the benefits of TBD are a completely separate discussion.

"- Calling an admin for more complicated tasks than checking in or out is … a myth. …"

On the previous CC projects I've worked, I always seemed to have some kind of scenario where I had to call in an admin. I suppose it wasn't always simple merges, but something seemed to always get funky, leading to me getting out of the chair and hearing faint mumbling about config specs. :)

"… I find the objections against CC mostly grown out of buggy tools and misinformation… Just one favour? Please, don't tell me the speed of your SCM is lowering your productivity…"

I don't want to make offense, but this is the argument that always comes up when discussing ClearCase that bothers me. It basically boils down to: "You must be using it wrong!" You're right, RAD is buggy, but considering who creates it, I think we can consider it part of ClearCase, it's all the same suite after-all.

I also completely disagree with your productivity statement. Everytime a build is accidentally broken because of the tool (even the IDE you need to work with to use it), productivity is lost. Everytime you have to call an admin over to look at your config spec, productivity is lost. It's not necessarily the speed of the SCM, since if that's a problem, even with clearcase, it's probably network related. Even SVN or Git can have the same issues. I will say though, I have never been on a CC project, even one where day to day usage was relatively zippy, where rebaselining wasn't measured in hours, and don't get me started on ClearQuest :)

Thanks for the well thought out comments. Having people poke holes in arguments is the only way to improve them, and improve your thinking on them.

Paul Hammant

@ketchup. Lots to respond to.

ClearCase admin:dev ratio (IBM has at one time or another recommended 1:20). AFAIK admins are not for merging exclusively, though they are known to get involved. You're right - that's a developer duty in a good team. Admins are for tricky stuff like (but not exclusively) repo repair, and branch creation.

You're point about working copy (SCMs other than dynamic-CC) differ greatly in merge capability. With Git and Mercurial, their merge-through-rename works incredibly well. An Agile team that's doing lots of refactoring need not instigate as much communication around such checkins. For P4/Svn and anything else, much conversation is needed - "hey everyone, I want to move ShoppingCart to an new package, any objections?". Someone with WIP on ShoppingCart is going to ask you to wait until they've checked in. Even Dynamic CC has downstream consequences for merge pain, after Agile style refactorings.

In terms of performance. Lets take the high bar for one operation. Say you want to sync/update from some canonical repo. Say there are actually no changes to come since your last sync/update. Perforce will take one second to tell you "already up to date". Static ClearCase (lets not mention PVCS Dimensions) can take 30 mins. Dynamic ClearCase makes you pay by in build and IDE minutes to make that sync/update look cost-free. Its a lie though. There is no Agile team on earth that is at anything even close to 2/3 effectiveness using Dynamic ClearCase, and no Agile team on earth that does not yearn for instantaneous and omnipresent SCM. I'd suggest that the same effective malaise even effects non-Agile teams, but I personally care less about them :-)

James Sears

Your enthusiasm doesn't mask your inexperience in using some of these tools.

ketchup

@Lucas,

sorry, my "wording" was not correct: "Biased" might be too harsh, and I don't want to be rude on your blog. Please contribute that lapse to my imperfect and non-native English language skill. And yes, to begin with, ClearCase, as any CSCM, is no longer cutting edge. I totally agree. But let me correct a thing or two.

- CC has atomic check-ins, but it has no change sets. You might be able to emulate change sets with labels, but that's hardly the same. Atomic check-ins just make sure a failed check-in won't ruin the repository (as it was possible in CVS), and does not help code integrity.

- Renaming a file in CC checks out the parent folder element and leaves in checked out. If your IDE checks it in immediately, your IDE is broken: Usually you have to check in the folder element yourself. (Actually, that's the coolest feature CC has, if you employ it correctly.)

- Two developers can never work on the same object without branching (except, maybe, in Etherpad or Google Wave). It's just that most SCM have an implicite branch (working area). CC has private branches. And: You will always have to merge if two developers concurrently worked on the same object.

- Calling an admin for more complicated tasks than checking in or out is … a myth. What's more complicated than check-in/check-out? Merge?!? A developer is far better eqipped to solve merge conflicts than a CC admin ever could (which might be the guy who knows how to handle an Oracle, but not your code). You just need to read the manual. I know, that's hardly done nowadays, but then again CC is a dinosaur. Be brave! Hit F1 once in a while!

So, while I'm still with you in any practical term, I find the objections against CC mostly grown out of buggy tools and misinformation (for lack of a better word - please refer to the first paragraph). Just one favour? Please, don't tell me the speed of your SCM is lowering your productivity, lest I'm compelled to ask how you measure productivity. ;)

Anyway, I should be glad as long as git is coming out of that race of yours first place. You know, in git, the developers are the SCM admins, so they all must learn how SCM works. Otherwise they simply can't work.

Lucas Ward

@Ketchup

'But is looks a bit "biased" against dynamic ClearCase: This can be quite responsive, even for large installations. It can be done! :)'

Since it's centralized, having good infrastructure can help with some of the pain, but it will never be as fast as something on a local hard drive. Furthermore, it's not atomic, which leads to all kinds of other errors. If you rename a file in your IDE, that rename is immediately committed, even if the change in files referencing this class aren't, which leads to a broken build. Check-ins aren't atomic either, so if there is an issue with one file, for whatever reason, you just broke the build. There's also the file locking. Two developers can't work on the same file at the same time without branching and a huge messy merge after the fact. There's also the slowness you'll encounter when you need to do anything more trivial than simple check-in and check-outs, which is why the admin to developer ratio is so high. Even if your network is responsive, your productivity will be lower.

So, I wouldn't say that the article was biased against ClearCase dynamic mode, but rather the things we thought were important in an SCM naturally push ClearCase dynamic lower, although, I'm not sure what features others would find more important that would push dynamic mode up. I would be interested in hearing them though.

Cosmin Stejerean

Preventing developers from leaving the building with source code once it is checked out to their machines, or preventing them from accessing source code another developer checked out are issues that affect all source control systems equally.

Regarding local checkouts, the only difference I see between a source control system like Git and something like Subversion is that by default Git gives you a copy of the entire history locally, as opposed to only the latest revision. Let's ignore for a moment the fact that it is possible to get a copy of every revision from any version control system. What about those historical revisions makes stealing a Git repository more concerning than stealing the latest revision from any other repository?

Any enterprise concerned about people stealing source code or any other IP can take the same kind of measures to prevent theft, from severely restricting network communications to superglueing USB ports. That's why I mentioned the issue of stealing source code is a red herring. It has nothing to do with Git.

Leaving aside the issue of stolen checkouts, Git allows one to control read access to the canonical repository in a similar way to any other tool. It is certainly possible to restrict read access to release branches, although the easiest way of doing so is to separate release branches with sensitive information into a second repository that only select people can access.

Paul Hammant

Git on Windows ..

I'm interested by AndLinux.org. As soon as I can make some space on my MBP, I'm going to do Win7 + AndLinux.org.

It might be more of a first class place for Gitters on Windows.

on 'Svn not making it easy'

In Summary.

I've thought some more and I think the summary of the permissions issue for Agile teams is that while all folks in a dev team should see all of trunk, some enterprises want to make release branches read restricted because they're putting settings into bash scripts or properties files (or alike).

Going back to trunk, I've only seen once in thirteen years of using SCM a company put read restrictions on items in trunk. That company had co-mingled multiple separate projects in one big source tree (kinda like this )

If you were using Git (or Mercurial) in such a strict enterprise, most likely you'd be ruling out the clone/pull from each other aspect of Git. Of course that would be a rule, as anyone could try to mount a share or open up SSH on their dev machines to subvert rules. That's true for Svn, CVS, Static-Clearcase (and more) too.

ketchup

Cosimin,

with hooks you can implement write policies, not read policies. That's basically impossible in a DSCM, since all you need is find someone who has read access, and clone his repo.

Red-herring or not (I disagree with you: I've seen co-workers sending out sensible data because they where not aware of the problem, and I've seen co-workers sending out code because they where unaware of any problem), this is an issue Enterprise has. Try it: Suggest to your project leader you want to copy all the source of your project onto a stick and walk it out the door, destiny unknown. It's a great exit strategy. From your project.

Second, cygwin git does not work as expected, since cloning is broken. Google for cygwin 1.7.1 git. As soon as cygwin has an issue tracker (besides subscribing to an email list), I might consider cygwin as a viable alternative to a Unix environment. (Hint: If you have a cygwin prior to 1.7.1, and you need itjust for git, don't upgrade until someone blurrs out it's working again. Trouble is to find out when it's fixed.)

@Paul,
yes, with the git-svn "bridge" you can suck an SVN dry. But neither does svn make that especially easy for you, nor is this the "fault" of svn. Plus your admin might implement policies here, and it would work for any CSCM you create a bridge for git for.

Don't get me wrong: Just because I knwo these arguments against git/hg does not imply I share the objections - at least not completely (face it: developers are as fallable as every human). I really don't want to defend the Enterprise position, especially not against git, which is, dunno how often I had stated this, my avourite SCM ever (and I worked with quite some SCMs). It's just that DSCM inherently empowers developers, while Enterprise (currently) wants to control developers, which results in a fundamental incompatibility which cannot be talked away.

Paul Hammant

IRO 'Red-herring' or not, lets expand that a little ..

With Svn (and Git-Svn proves it) you can pull out all commits ever to a repo (trunk + all branches) and end up with a zip that is quite manageable in terms of flash drive. Thus Svn can also fail the "competitor could receive 'all code ever' too.

That said, paths can be hidden behind Apache DAV permissions. Perforce can similarly have restricted paths/branches. Clercase breathes this sort of restraint.

Thus we need to work out configurations whereby the canonical git repo has variable restrictions per cloner.

Cosmin Stejerean

You might not be able to enforce hooks that run on the developer's workstation, and I don't think one ever should try to enforce hooks at that level.

You can however enforce the hooks that run on the canonical source repository, which is what really matters for enforcing controlled read/write access. You can prevent people from pushing changes to a certain branch, from pushing changes that touch a certain file, from pushing changes with commit messages that are not to your specification, etc.

If you are curious about how specific workflows you have encountered can be implemented, I will gladly attempt to give you examples of how I would implement them in Git.

The whole argument about non-loyal developers stealing source code is a red-herring as it has nothing o do with Git or any other source control being used.

Regarding Windows support, it is true that traditionally Git has been a second class citizen on Windows, requiring cygwin to run. That said, support for Git on Windows has become a lot better recently.

Git under cygwin should run just fine, but if that does not work for you, or you do not want cygwin, I recommend trying msysGit. A good starting point for getting started with Git on windows is http://github.com/guides/using-git-and-github-for-the-windows-for-newbies

ketchup

Cosmin,

hooks are nice. Problem is: they are not mandatory. You cannot force developers to use them, so you cannot enforce. Hence, you cannot implement an Enterprise workflow (let alone certificate it), because a workflow you cannot enforce is … not a workflow.

If you are Enterprise, you don't assume loyal developers. They are human resources, as loyal as you pay them if you're lucky (but Enterprise is not about being lucky). Hence they might work for your competition tomorrow. You don't want to give your competition (or whoever might be after your busines vital information) a headstart by handing over all your code. A, sorry, here's the right button: all your code, ever.

In other words: Enterprise is driven by accounting and legal, not by geeks.

Next, Enterprise loves Windows as workstation platform. There's no Windows git available (don't tell me there's cygwin unless cygwin git can clone a repository at the first try again).

You see, it's not only git that has a hard time in Enterprise. It's primarily that DSCM is not yet trusted, and this distrust is not entirely unreasonable.

juancn

I use a clearcase derivative at work (which added a form of pseudo atomic commits). It's fugly, slow, and extremely unreliable!

We actually built a bridge using git as a frontend to make it palatable (with a lot of script-fu to keep histories in sync).

With network latencies > 200ms, the unnamed-CC derivative is completely unusable (I'm in Argentina and the data center is in the US).

Opening a large project using intelliJ used to take between 2 to 4 hours! Now it takes a couples of minutes at worst.

Paul Hammant

Mike,

The Svn, P4, Git/Mercurial rankings above are ultimately about merging as you surmise.

Subversion is weaker than P4 presently because its merge point tracking cannot work through a rename. Indeed for Subversion 1.6.3 through 1.6.5 the merge would abort in particular situations because of conflicts concerning renames. 1.6.6 sailed past the 'abort', but left us with conflicts that we'd need to resolve later. That conflict would not have existed in P4 with the appropriate branch-spec (see below). It would not have existed in Git/Mercurial at all.

Perforce can do merge through rename if you setup branch mappings - though it's pretty second class. It can't be tooled because branchspecs are in themselves committed. They transcend versions. If they were committed as part of a change list, then the likes of Intellij, Eclipse and 'Studio could update specs when refactorings happen.

While I agree that trunk is best, merge is also used for updating working copy (tis often forgotten). To that end, Subversion leaves you abruptly, Perforce can't help because a branch-spec is about branches not working copy, and only Git and Mercurial update your working copy with renames/moves regardless of whether you had local commits that have not been pushed to the backend.

Now there is nothing here that the Subversion teams (or Perforce the company) cannot fix.

Meanwhile, like you, true trunk based development is what I recommend as teams mature through experience with SCM tools.

Mike Mason

Could you articulate why Subversion doesn't make it into the next level? By my reckoning it hits most of the criteria.

Whilst the best developers in the world undoubtedly will make use of the new features in advanced tools like Git and Mercurial, those features are likely to be dangerous to the average corporate developer.

Right now we are constantly facing teams who have some crazy branching "strategy" in place, usually to hide the fact that they have no confidence in running their projects and want to hedge their bets with a crazy merge tree. Surely a DVCS, where such merge swapping is even easier, would make this problem worse and make people even less likely to actually fix their schedules and engineering practices.

Cosmin Stejerean

There is nothing about Git that prevents controlled read/write access or implementation of enterprise workflows.

Between the distributed nature of Git and the available hooks one can implement even the most complicated enterprise workflows.

Perhaps we need more commercial support options for Git in order to allow companies to feel comfortable.

ketchup

Nice job.

But is looks a bit "biased" against dynamic ClearCase: This can be quite responsive, even for large installations. It can be done! :)

Further, "Branching and tagging expensive" is a bit broad, since branching and tagging are quite different use cases: Having expensive branches might not be as big as a problem as having expensive tags. Depends on your development model, of course.

Then again, I agree that Git and probably Mercurial (I don't use hg, so I can't really tell) are clearly marking the top of the line, if only technology is considered, and I clearly prefer git over any other SCM I worked with, including SVN, ClearCase, Synergy and, of course, CVS/RCS.

But Enterprise decisions are made at least partial in laws department, which calls, as a minimum, for controlling (read) access and enforced certified workflows. The nature of DSCM is against that, and evolution will take time. So I guess we'll see Subversion around for quite some while.

Spring Batch and DDD

Nov 2nd, 2008

I use Google blog search (http://blogsearch.google.com/) to keep track of anyone blogging about Spring Batch. Usually, blog posts are ‘getting started guides’, with simple examples, which is understandable given the way our samples are organized. Our samples are indented to test nearly every framework feature in a real batch job, and serve as a form of functional test. This is good for developers that understand the framework and want to know how to use feature X, but not so good if you’re new to the framework. I always intended on creating a simple example, but I think the blogging community has been so good about creating them, I’m not sure we could add much value.

I’m slightly digressing though. I recently ran into a much different blog post about Spring Batch:

http://java-chimaera.blogspot.com/2008/11/spring-batch-domain-driven-design-in.html

The main theme of the post is that Spring Batch serves as almost a reference implementation for DDD:

“By applying DDD in Spring Batch, we now have a realistic and elaborate example of concepts like ‘repository’, ‘factory’ or again the ‘ubiquitous language’. What’s also interesting in this implementation is that, by being a technical domain, developers can easily comprehend the concepts and identify them more easily.”

In nearly every presentation Dave or I has given about Spring Batch, we usually start describing the domain model, and express that Eric Evans’ book was highly influential in how we designed Spring Batch. I can’t begin to tell you how much this model changed from when we first start until 1.0. In the original model, Steps were called StepControllers, and so on. As we continued through development, they naturally began to evolve. We were very conscious of Evans’ concept of Ubiquitous Language, and as we noticed ourselves continuing to refer to a concept as something, we usually decided that the class should be called that, which ultimately let to the simple Job and Step in the framework today.

I could talk for days about the evolution of the domain model in Spring Batch, but in the interest of keeping this post reasonably sized, I’ll simply summarize by saying that DDD was a big driver in the creation of Spring Batch, and it’s interesting to me that others unrelated to the project can see this by simply inspecting the code. I think it speaks volumes for the approach.

Blog Archives Newer →

My Octopress Blog

A blogging framework for hackers.

Spring Batch Deployment Example

Comments

Test Code Quality

Comments

Boiler Plate Code and Closures in Java

Comments

SCM Continued

A Maturity Model for Source Control (SCMM)

Comments

Spring Batch and DDD