Small Stuff That Matters – Ubuntu File Associations

Often times there is a little thing that you would like to figure out how to do it and because it is little you tend to delay trying to figure out due to your busy schedule.

An example of this little thing is figuring out how to associate an application to a file based on file type on Ubuntu. In my case is PDF files. By default “Document Viewer” is used to display the content of a PDF file and it works, but I prefer to use Acrobat Reader. Of course the solution is already out there if you search for it, but what I am amazed at is people are really happy when they found that solution and I know this because of their thank you notes.

Following the DRY principle, here is the link to the solution that I found.

Posted in Linux | Tagged | Leave a comment

Hiring & Acceptance Test

I came across this posting on a Yahoo! group recently and it was too good not to pass it along.  It is about writing an acceptance for hiring a senior developer in the spirit of Extreme Programming.

class SeniorDeveloperAcceptanceTest extends TestCase{
   Developer candidate;
   Collection team;

   public void setUp() {
      candidate = new Developer();
      team = YourCompany.getTeam();

   public void testTechnicalSkills() {

   public void testTeachingSkills() {

   public void testHumanBehavior() {

   public void testMethodologySkills() {

Posted in Career, Software Development | Tagged | Leave a comment

Java Concurrency Synchronizers

Java Concurrency Utilities provides a number of powerful and high-performance threading utiities.  At the high level they can be grouped into four categories and this article will cover one of the categories.

  1. Thread Pools and Task Scheduling
  2. Concurrent Collections
  3. Locks and Synchronizers
  4. Atomic Variables

We all know that Java supports synchronization since day one through the synchronized keyword, but the limitation is that this mechanism works at the block level and is limited to a single thread at a time.  A number of new mechanisms were introduced.  Among them are semaphore, barrier, latch and exchanger.

Semaphore are used to control or limit the number of activities that can access a certain resource or perform a given action at the same time.  An easier way to understanding and remembering what a Semaphore is by associating Semaphore with permits.  A semaphore maintains a set of permits and a thread must acquire a permit from the semaphore before it can obtain a resource or perform a certain activity.  The permit is returned to the semaphore when thread is done accessing a resource or perform a certain activity.  If all the permits were already given out, then the next thread that asks fro a permit will be blocked.

Latch is used to allow one or more threads to wait for a set of threads to complete an action.  Once a latch is set, it never changes.  Latch is commonly used to coordinate certain threads and the common use case for latch is start several threads and have them wait until a signal is received from a coordinating thread.  Another example in a multiplayer games is you don’t want the game to start until all the players have joined.

CyclicBarrier is used to create a barrier and there are two different kinds of barrier.  The first kind is a barrier with a number of threads and the other is a number of threads and a barrier action.  The barrier action is a Runnable task that runs when all the threads have joined together.  Basically a barrier is used stop a set of threads from running until they all reach a specified point.  Comparing to a latch, which is used to let threads run wild, a barrier is used stop a set of thread.

Out of the four kinds of synchronizers, exchanger is a unique one.  It is used to allow two threads to exchange data in thread-safe manner.  Imagine the producer and consumer problem,  an exchanger can be used to allow producer and consumer to exhange the buffer that contains tasks to do in one shot, instead of consumer picks one task out of the task queue at a time.

Next article will cover Atomic Variables.

Posted in Java | Tagged , | 2 Comments

My First Writable With Hadoop

Hadoop uses a simple and efficient serialization protocol to serialize data between the map and reduce steps.  There is a lot going on between these two steps, but this article is not about that.  Rather it focuses on what a developer needs to know in order to write a custom Writable class.  Just for the folks that are new to MapReduce in Hadoop, the OutputCollector in the Map and Reduce step accepts only the value as of type Writable.

Writable is an interface in Hadoop and it has two methods: void readFields(DataInput in) and void write(DataOutput out).  If you browse Hadoop Javadocs, ther are roughly about 43 classes in Hadoop that implements the Writable interface.

Depending on what your MapReduce application needs, one of the out-of-box Writables will do the job, but if there isn’t one, then it is fairly straight forward to write a custom Writable.  That’s what I had to do for my project.  What I discovered and there isn’t that much documentation on it is in addition to the two methods defined in the Writable interface, you also need to implement the toString() method if you want the data in your custom Writable to appear in the output file (this took me sometime to figure out).

One of the interesting Writables is the GenericWritable.  This comes in handy when the Map and Combiner output the same key type, but different value type.  The requirement in Hadoop is the values that are mapped out to reduce, only one value type is allowed.  The GenericWritable basically helps you wrap instances of value of different types.  See GenericWritable JavaDoc for more details.

Posted in Lucene Hadoop, Uncategorized | Tagged | Leave a comment

Setting Up Hadoop On Windows

I recently had to learn Hadoop and getting Hadoop running on Windows Vista is not as straightforward as I thought.  I believe getting Hadoop up and running on Linux is much more easier.  I would like to use this blog to share my experiences and hopefully it will help the next person that is trying to do the same thing.

My goal was to be able to:
1) Run some of the examples that came with Hadoop
2) Run a MapReduce Java program in Eclipse and able to debug it

First Problem:

Hadoop comes with a set of shell scripts, so the first thing to do is download and install Cygwin.  Since the scripts were written on some Unix variation, they will not work out of the box.  When I tried ‘bin/hadoop’ command, I got the following:

./bin/hadoop: line 18: $'\r': command not found
: No such file or directory./bin
./bin/hadoop: line 21: $'\r': command not found
: No such file or directorydrive/c/tools/hadoop-0.14.4

Apparently this is a common problem and it is related to new line differences between Windows and Unix.  Windows uses two characters (\r\n) and Unix uses on character (\n).  Here is a link to a solution.  Basically you need to run comand ‘dos2unix’ on the Hadoop’s scripts or use your favoriate Unix command to strip out the ‘\r’, i.e sed $’/\r//g’ <file name>

Once this problem is resolved, then I was able to run the examples that came with Hadoop like WordCount or Grep inside Cygwin shell.

Second Problem:

My second goal was to run one of the Hadoop’s example inside Eclipse.   When I tried to this I got an exception while Hadoop is trying to create a process – ‘CreateProcess error=2’ and the command is something like ‘df -k’.

So it was frustrating because I was able to the examples in Cygwin shell.  It turned out the MapReduce framework is trying to execute a command ‘df -k’.  Once I added the Cygwin path to Vista PATH environment variable, then this problem went away.  It was great!! Now I can actually step throug the code.  As a developer, this is very important.

I am looking forward to sharing my Hadoop experience as I learn more about it.

Hadoop is a very powerful piece of technology and often power comes with complexity.

Posted in Distributed Programming | Tagged | 5 Comments

Nutch & Lucene

Just discovered a good video about “Experiences with the Nutch search engine” and the presenter is the project founder himself, Doug Cutting.

He gave a great history of Lucene and then covered Nutch, Hadoop and their future. This is a great way for gaining overview of these 3 very interesting and popular technologies.

Experiences with the Nutch search engine
– Doug Cutting

Posted in Lucene Hadoop | Tagged | Leave a comment

Lessions Learned From Recent Project

Reflection is an important process in the journey of learning. This post tries to capture the lessons learned from leading a project. This project is not a large one and not extremely complicated, but it encompasses all the typical elements of a software projects, for example, project planning, design, implementation, testing, task assignment, collaborating with other teams like QA, public perception, etc.

The lessons are categorized by the various aspects of a software project.

Conceptual Phase

This phase is all about thinking, analyzing and validations. It requires exercising a large number of neurons in your brain. Strong analytical and thinking on your feet skills are useful tools to have to be effective.

Like most of other tasks, the task of translating conceptual ideas into something more concrete will require some trial and error thinking, exploring options, validating new ideas and be open to other perspectives. One must be open to new ideas and not rush into judgment. Use cases are the guiding lights as well as your best friends.

This is probably the most challenging phase. Coming out of this phase may reveal that one needs to be humble and accepting the fact that there are so many more things that one doesn’t know.

Technology Selection

Working with new technologies is always fun and challenging at the same time. The important thing is to make sure they can satisfy your requirements. Resist the urge to use a piece of new technology just for the sake of fun or seeing it as another bullet point on your resume.

Technologies suppose to help you becoming more productive and having better sleep at night. Therefore using familiar technologies or mature ones is a safer bet.

Building Phase and Pounding the key board

This is all about following a blue print that was laid out in the previous steps with a grain of salt. When the rubber meets the road, sometimes it requires revisiting the initial thinking and coming with another alternative solution.

The other important elements are avoiding taking shortcuts and about being consistency. Short cuts will come back and haunt you during the QA cycle or worse after it is in production. Typically an application has a set of similar structures and functionalities.  It is important to be consistent in the implementation of these similar parts.

Consider unit testing as your friend. Test cases will provide confidence during refactoring or enhancement times. Writing test cases requires at least 40% of the overall development effort, so don’t underestimate this task.

Centralized error handling and other useful facilities that are needed across different parts of the application. This will help avoid committing the “duplicate codes” sin.

Make it a habit to do code review. This is a great chance for sharing knowledge and catching bugs before they become real bugs. In addition, this will help maintaining consistency, but not necessary eliminating it.

Teamwork and Communication

Small teams have better success in accomplishing something concrete in a reasonable amount of time. Nevertheless, make it a priority to communicate, via meetings, informal conversation in the hallways, or in someone’s cube.

This will also help catching wrong assumptions in a timely manner and have sufficient time to correct them. In addition, it will help building working relationship and making it more fun to work with your teammates.

Document the technical decisions that were made during the course of the development phase. Any time something requires more than one person to make a technical decision, it is important to document the details of the decision such as what was the agreement, the reasoning, and possibly other options were discussed. This is mainly because we have limited memory and the number of technical decisions is not small. The obvious benefits from doing this are you don’t have clutter your memory and you can easily defend your decision when someone asks about it 6 months after the decisions were made.


Being disciplined about following good programming practices through out the development phase is an important aspect. Among them are code review and writing unit test cases.

Often time during the initial development, some shortcuts or hard coding were made. Make sure to remove them as soon as possible. Otherwise they will create surprises and may cause missing the deadline. In our project, we delayed the need to integrate with corporate single sign on until it was absolutely necessary. When we finally took the plunge to do this, it requires more time than expected. We had high confidence that it will work so it was not a huge risk that we took.


Project schedule helps providing a road map to march forward to. The challenging part is once the schedule is public, it is almost an mandate to accomplish your work according to that schedule. The key is possibly to have private and public dates and preferably the later comes after the former.

A couple of days before the committed public date arrives, it is important to evaluate where you are. If you think you are running late, it is important to communicate this with your manager and possibly your customer.

Sometimes you need to see what’s are the committed features and possibly declaring the code completion without the minor features that your customers don’t care too much about or can be delayed to a later release. By doing this, you will be able to make your customer happy and at the same time you don’t loose the trust that other have on you.


Keeping pacing by breaking down your development cycle in a number of smaller cycles. This allows you to sprint and then take small breaks to re-energize. The problem with a long development cycle is that it creates an great opportunity to procrastinate and eventually you will be caught by surprise. Procrastination is a part of human nature and it is better to create an environment that promotes anti-procrastination.

Posted in Software Development | Tagged | Leave a comment