Tuesday, December 7, 2010

HashMap of a HashMap of a HashMap Problem -- Part 4 - A Way To Correct The Problem

Previously we have seen how collections can be used to circumvent the type system.  They can be thought of creating classes without actually creating classes.  This can be useful when you have simple relationships you want to store, but confusing as you add more complicated relationships.  They lead to maintainability nightmares where even the original author does not understand what is occurring.  The code is mucked up with nested data structures both in the definition of the relationships and the use of them.  Now we’ll see how to correct the problem.
The way to correct the problem is not revolutionary.  In fact, it is not something that I used to think I had to explain.  It was only after I saw multiple developers create the HashMap of a HashMap of a HashMap problem and explained what they should have done that I realized not everyone saw it as intuitive.  The solution is simply applying good object oriented principles.  Let us return to our Person class we created in Part 2 (for now we will keep the birthday out of the relationship):
public class Person
{
    public string Name { get; }
    public List<string> Nicknames { get; }
}

In Part 3 we added complexity that the name was actually just a first name and that we wanted to tie last name to the first name and the list of nicknames. Let us add this information to our Person class:

public class Person
{
    public string FirstName { get; }
    public string LastName { get; }
    public List<string> Nicknames { get; }
}

We have now defined a class that encapsulated the information contained in our LastNameToFirstNameToNicknameMap in Part 3 into an object.  If we have a list of Persons we have the same information as we did in our complex nested properties.  However, we’ve lost some of the ability to quickly retrieve values based upon name values.  We have to look through the list of persons for a match.  There are a couple ways we can approach the problem at this point, depending upon how we want to expose and deal with our data.  The following is just one way that this can be handled.  Let’s call a collection of people who share a last name a Family.  A family class would on a basic level look like this:

public class Family
{
    public string LastName { get; }
    public List<Person> People { get; }
}

 I’m purposefully omitting the constructor and methods needed for management of the family for simplicity.  Let us also think about the other layer of data we wanted to have which was nationalities and add that data to our objects as well.  Our Family and Person class would look as follows:

public class Person
{
    public string FirstName { get; }
    public string LastName { get; }
    public List<string> Nicknames { get; }
    public string Nationality { get; }
}

public class Family
{
    public string Nationality { get; }
    public string LastName { get; }
    public List<Person> People { get; }
}

You’ll notice that Nationality and Nickname is duplicated between the two objects.  This is purposeful on my part.  This could be handled in other ways, like as a Person could have a reference to his family.  This makes sense in our contrived example, but in practice, I find the way this is structured to be more practical.  In reality, I often want to create the type of object a Person is our example of without knowing what family it will belong to at creation time.  However, I know the properties like LastName and Nationality at creation time.  What I mean is that I want to construct families from people not create families and then add people to it.  It is a subtle difference, but is the general case where I’ve seen the HashMap of a HashMap of a HashMap problem occur.  You of course should write the data structure that is most appropriate for your situation.
Ultimately, I need some sort of top-level object that will provide be access to the values.  I could just use a list of Families and have some outside object manage the adding and removal of items, but that breaks the object-oriented approach we are going for.  What we really want to do is encapsulate the management logic into a class and provide convenience methods for access and manipulation.  Let’s call our top level object Families.  I am going to construct a class with some basic features for creating our constructs based upon a person.  Let me preface this by saying it is meant to illustrate the structure, not be a fully functioning example:
public class Families
{

    /// <summary>
    /// Private class to encapsulate the logic connecting a last
    /// name to a family
    /// </summary>
    private class LastNameFamilyMap{
        private Dictionary<string, Family> _familyMap;

        public LastNameFamilyMap()
        {
            _familyMap = new Dictionary<string, Family>();
        }

        public void AddFamily(Family family)
        {
            _familyMap.Add(family.LastName, family);

        }

        public void RemoveFamily(Family family)
        {
            _familyMap.Remove(family.LastName);
        }

        public Family GetFamily(string lastName)
        {
            Family family;
            if (_familyMap.TryGetValue(lastName, out family))
            {
                return family;
            }
            return null;
        }
    }

    /// <summary>
    /// Private class used to encapsulate the nationality
    /// to families relatonship.
    /// </summary>
    private class NationalityFamilyMap
    {
        private Dictionary<string, List<Family>> _nationalitiesToFamilies;
        private Dictionary<string, LastNameFamilyMap> _familyMappings;

        public NationalityFamilyMap()
        {
            _nationalitiesToFamilies = new Dictionary<string, List<Family>>();
        }

        public void AddFamily(Family family)
        {
            List<Family> families;
            if(!_nationalitiesToFamilies.TryGetValue(family.Nationality, out families)){
                families = new List<Family>();
                _nationalitiesToFamilies.Add(family.Nationality, families);
            }
            families.Add(family);

            LastNameFamilyMap lastNameMap;
            if (!_familyMappings.TryGetValue(family.Nationality, out lastNameMap))
            {
                lastNameMap = new LastNameFamilyMap();
                _familyMappings.Add(family.Nationality, lastNameMap);
            }
            lastNameMap.AddFamily(family);

        }

        public void RemoveFamily(Family family)
        {
            List<Family> families;
            if (_nationalitiesToFamilies.TryGetValue(family.Nationality, out families))
            {
                families.Remove(family);
            }

            LastNameFamilyMap lastNameMap;
            if (_familyMappings.TryGetValue(family.Nationality, out lastNameMap))
            {
                lastNameMap.RemoveFamily(family);
            }
        }

        public IEnumerable<Family> GetFamilies(string nationality)
        {
            List<Family> families;
            if(_nationalitiesToFamilies.TryGetValue(nationality, out families)){
                return families;
            }
            return new List<Family>();
        }

        public Family GetFamily(string nationality, string lastName)
        {
             LastNameFamilyMap lastNameMap;
            if (_familyMappings.TryGetValue(nationality, out lastNameMap))
            {
                return lastNameMap.GetFamily(lastName);
            }
            return null;
        }

        public Family GetFamily(Person person)
        {
            return GetFamily(person.Nationality, person.LastName);
        }
    }

    private NationalityFamilyMap _nationalityFamilyMap;


    private List<Family> _families;
    private List<Person> _people;


    public Families()
    {
        _nationalityFamilyMap = new NationalityFamilyMap();
        _families = new List<Family>();
        _people = new List<Person>();
    }

    public IEnumerable<Family> Families { get { return _families; } }
    public IEnumerable<Person> People { get { return _people; } }


    public void Add(Person person)
    {
        Family family = _nationalityFamilyMap.GetFamily(person);
        if (family == null)
        {
            family = new Family(person);
            _nationalityFamilyMap.AddFamily(family);
            _families.Add(family);
        }
        else
        {
            family.People.Add(person);
        }
        _people.Add(person);
    }

}

This may seem over whelming at first.  I also apologize for the dearth of comments.  This is just one way you can implement the management of the relationship we had in only collections at the end of Part 3.  There is a lot more complexity in the code here than the basic collection structure we had, but it is all for good reason.  What we are doing here is encapsulating the management into the object itself.  We are using two private classes to abstract away the underlying collections.  
The private classes may seem complex, but they are in reality quite simple.  Their purpose it maintain specific relationships between objects and enable us to easily retrieve those relationships. Lets look at the LastNameFamilyMap alone.
    /// <summary>
    /// Private class to encapsulate the logic connecting a last
    /// name to a family
    /// </summary>
    private class LastNameFamilyMap{
        private Dictionary<string, Family> _familyMap;

        public LastNameFamilyMap()
        {
            _familyMap = new Dictionary<string, Family>();
        }

        public void AddFamily(Family family)
        {
            _familyMap.Add(family.LastName, family);

        }

        public void RemoveFamily(Family family)
        {
            _familyMap.Remove(family.LastName);
        }

        public Family GetFamily(string lastName)
        {
            Family family;
            if (_familyMap.TryGetValue(lastName, out family))
            {
                return family;
            }
            return null;
        }
    }

We have specific methods to add, remove and add families.  This gives us explicit control over changes to our structure.  This is something we don’t get with straight collections.  We’ve encapsulated the underlying dictionary into a structure so that if we want to keep track of additional information from our Family object we are able to easily.  Our management points for changing this are only in  the AddFamily and RemoveFamily methods of the class.  This means we can easily provide a reverse lookup Dictionary, which would be useful if LastName wasn’t part of the Family object (which in the real world might be the case).  We also are sure that nobody can change our underlying collection without accessing it through our methods. 
This is basic data encapsulation but it’s something that is not often done.  I was working on a relationship like this just the other day.  I had to allow the storage of a new type of relationship.  If I had just had a HashMap of a HashMap of a HashMap I would have never been able to enforce the tracking constraints that I did.  Let say I wanted to make it so only families with a last name that began with A were being tracked.  This is an easy check to put into our Add method.  Or I can make it so families with more than three members cannot be removed.  The data encapsulation puts our logic into this one location and ensures that the rules are always followed.
 Looking at our private classes we could have substituted the LastNameFamilyMap  reference in NationalityFamilyMap with just a straight Dictionary but then we’d be back to where we started with a Dictionary containing a Dictionary.  In essence, this is what that relationship is, but we are using another object to indirect this relationship.  All the complexity of dealing with the dictionary moves inside the object making additions and removals from it much simpler.
The same is the case for our NationalityFamilyMap.  Here we actually have two Dictionaries.  One allows us to easily keep track of Families based upon nationality while the other provides a quick look-up for a Family based upon Nationality and LastName of a Person. 
Because of the sub-classes manages the adding a person to our collection is quite simple.  We can easily look-up if that person’s family exists through our sub-class.  If it does not exist, we can easily create a new one and add it to our internal collections and objects.  The onus for the complicated logic needed to create add a new Person, including creating a new Family is on us as a developer of the management class, not the user of the data structure.  Not included in the code are the accessors that the definition of this structure might require.  We might want to get all the families of a nationality.  This would involve simply adding a method to our Families class that looks like this:

public IEnumerable<Family> GetFamilesOfNationality(string nationality)
{
    return _nationalityFamilyMap.GetFamilies(nationality);
}

This is much simpler than what we would have had to write if we still had our collection only implementation from Part 3.  The function that would then have looked like:
 public IEnumerable<Dictionary<string, List<string>>> GetFamilesOfNationality(string nationality)
{
    Dictionary<string, Dictionary<string, List<string>>> lastNameMap;


    if (NameMap.TryGetValue(nationality, out lastNameMap))
    {
        return lastNameMap.Values;
    }
    return  new List<Dictionary<string, List<string>>>();
}

Underneath the hood, the code would look very similar but as far as readability goes, the version with the Family class is much easier to read.  The layering of objects also allows us to keep congruent sets of information to speed up retrieval when we want something other than a direct lookup.  We keep a running list of families at very little cost.  If we wanted to keep them sorted in some manner (add order or on some value) this would be easy as well.
Part of what makes this method of encapsulating values valuable is controlling access.  We limit the access points through our add and remove methods.  We therefore are able to add new backing collections that aid in our access easily.  Since we have limited access points, object management only needs to be added in those specific points.  The added complication in our object ultimately makes for a more elegant code that uses our management object.  It is also more maintainable as the encapsulation pinpoints responsibilities within objects.
We could keep examining the structure in detail and analyze all of its facets, but I think this is sufficient for now.  I would encourage you to try writing accessors to get values out of each structure and see which one you like using better.  When writing accessors for the class structure feel free to add new collections to improve retrieval time and see how the code easily falls into place. 
What I hope you have gotten out of the last series of posts is an understanding of the HashMap of HashMap of HashMap problem.  I hope you see that in code it is an unwieldy structure to use.  It is a way of circumventing object-oriented design.  I hope I have given you at least one path on how you can transform that structure into classes.  My hope is that in the future when you see nested collections you will question if that is really how you want the data to be contained.  It is more work up front to create the classes to contain the data but in the end, you will save yourself headaches if you do write specialized classes to manage complex relationships.  

Wednesday, December 1, 2010

HashMap of a HashMap of a HashMap Problem -- Part 3 - Seeing the Problem

Previously we looked at the need for meta-data and good naming when using collections to form associations.  We also looked at how collections are used to circumvent typing.  Now we are really going to see how this circumvention can get you into trouble and what my lead developer was actually talking about in his argument with the other developer.
Up to this point we’ve created properties for our mappings without a class to contain them, this has been very purposeful.  Let’s ignore that we added the name to birthday mapping and assume we still just have name to nicknames that we want to map.  Let’s say these names are only first names.  Let’s say we want to map last names to a group of first names that match to a list of nicknames.  That’s probably confusing to read as we have now chained three things together and that makes it hard to follow.  A property for this sort of value structure, using only collections, would look like this:
public Dictionary<string, Dictionary<string, List<string>>> NameMap { get; }

I’ve gone back to the simple naming convention and lack of documentation on purpose to illustrate how I’ve seen similar relationships in code.  If you were given only this property with no documentation and a name like NameMap what would you make of it?  Would you have any idea where to begin?  Quite simply, no.  You’d have to start digging through the code to determine what all the values are for.  Let’s be “kind” and add documentation and improved naming.

/// <summary>
/// Gets a map of last names to a map of first names to nicknames for
/// the person with that first name. The dictionary maps
/// last names -> first names -> nicknames
/// </summary>
public Dictionary<string, Dictionary<string, List<string>>> LastNameToFirstNameToNicknameMap { get; }


If you can understand what that comment intends to communicate then congratulations.  I’m going to be honest and tell you I spent five minutes trying to make a coherent sentence to describe the structure and believe I have failed miserably.  The only way I could even remotely express what was being contained in the structure was to use the arrow notation at the end of the comment.  That only gives me the very basic information.  The fact that I was having trouble explaining the data structure should set off red flags that it is not a good data structure.  Also, notice the ridiculous name I had to give the property in order to clarify its meaning.  There are eight separate words in the name.  It explains the object but I tend to get lost in the middle or reading it.  Its way too long, even though it spells out our relationship.  Again, this is another red flag.
We are now beginning to see the HashMap of a HashMap of a HashMap problem.  We only have two levels of nesting with our HashMaps and we already are beginning to see that it is becoming unmanageable. 
Let’s take it one step further and see how this gets even more complex.  Say we want to add one more level of lookups to our data structure and map nationality to our previous structure.  For simplicities sake I’m just going to consider everyone with the same last name shares a nationality and I am going to use a string for nationality.  Obviously, this is not how the real world works and we’d probably create an enumeration for nationality.  This example is contrived, but we’ll live with it because I’m attempting to illustrate a problem that occurs when you have nested data relationships.  This is the important factor for our contrived example.  If we want a property that returns our new relationships we would then have:
public Dictionary<string, Dictionary<string, Dictionary<string, List<string>>>> NameMap { get; }

We now have a HashMap of a HashMap of a HashMap with a contained list thrown in for good measure.  Good luck ever understanding what is going on here.  The scary thing is I’ve come across such a data structure more than once in my career.  It scares me that someone could look at a structure like this and not think that there was something wrong.  I’m not going to even attempt to add the meta-data and transform the name.  I encourage you to try to do so in a cogent manner.  It is near impossible to write an explanation of what is being mapped here.  Part of our lesson here is if you cannot explain the basic structure of a property in a sentence, it is probably too complex.
The structure also leads to complex access code with nested if-blocks just to get a value.  Let’s assume we want to see if a nickname belongs to someone with a given nationality, last name, and first name.  The code to do this would look something like:
public bool HasNickname(string nationality, string lastName, string firstName, string nickname)
{
    Dictionary<string, Dictionary<string, List<string>>> lastNameMap;
    Dictionary<string, List<string>> firstNameMap;
    List<string> nicknames;

    if (NameMap.TryGetValue(nationality, out lastNameMap))
    {
        if (lastNameMap.TryGetValue(lastName, out firstNameMap))
        {
            if (firstNameMap.TryGetValue(firstName, out nicknames))
            {
                return nicknames.Contains(nickname);
            }
        }
    }
    return false;
}

There are 3 nested ifs and three objects being created as outs that are collections themselves.  The long and short of this being that even a simple boolean check becomes insanely complex.  I even tried using LINQ to write a simplified version of this and I got myself lost in it.  Point being is the end user has to write complex data accesses to use our structure. 
The overarching point I’m trying to make is that if you have more than one level of nesting of collections you should re-evaluate how you are writing your code.  You are relying on weak references to build your structure.  Your code will be very hard to maintain.  It is hard enough to try to explain what the structure is encapsulating.  It is even harder to understand it without a priori knowledge.  My general rule of thumb is that anything past our original nesting of a list in HashMap (Dictionary) is too much and should be rewritten.  You are putting a large, unneeded burden on code that uses your data structure. 
Next time we will see what we should do instead  to fix/avoid the problem.

Sunday, November 28, 2010

HashMap of a HashMap of a HashMap Problem -- Part 2 - Adding Complexity

In our previous part we started discussing how even basic collection usages require careful use of meta-data.  Our example was simple but now we are going to begin to add some complexity to it.  This complexity will begin to lead us down the path to the problem.
 Let us make some modifications to our example to add a little complexity.  In all likely-hood, a person will have more than one nickname.  We want to be able to retrieve all their nicknames based upon their name.  Therefore, we modify our property to be as follows:
/// <summary>
/// Gets a map of a user's name (the key) to their nicknames
/// (the value) so we can look up their nicknames quickly
/// </summary>
public Dictionary<string, List<string>> NameToNicknamesMap { get; }

This still is not horrible but it is more complex.  We now have a nested collection that causes us some grief in writing code to access the values.  This is not a big deal it just makes our code less clean.  We still can easily understand it if the property is well documented.
 Let’s step back for a second and make an observation of this structure.  Why would this structure exist?  It exists because we want to make some sort of connection between name and nicknames.  Doesn’t that sound an awful lot like what we should use a class for?  Let’s write a class that provides the same connection:
public class Person
{
    public string Name { get; }
    public List<string> Nicknames { get; }
}

We then would also change our property to:
public List<Person> People { get; }

There is one thing that we are missing in this implementation.  That is we’ve lost the ability to quickly lookup the nicknames based upon the name.  We will ignore this for right now and revisit it later on once we develop our structure further.
Our class has created as strong relationship between our name and nickname.  The relationship is explicit and there is no reliance on meta-data to understand the relationship.  Using the old structure if we wanted to also map name to another value, like birthday, we’d have to create another property like this:
/// <summary>
/// Gets a map of a user's name (the key) to their birthday
/// (the value) so we can look up their birthday quickly
/// </summary>
public Dictionary<string, DateTime> NameToBirthdayMap { get; }

In creating this property, we have to ensure that we provide the same detail in the meta-data and naming to keep straight what we are mapping.  If we want to create the same relationship using our class we just add a new property for the birthday and we have created the same relationship.  Our Person class now looks like this:
public class Person
{
    public string Name { get; }
    public List<string> Nicknames { get; }
    public DateTime Birthday { get; }
}

We now have strong links between the name, nicknames, and birthday.  This is simple object-oriented encapsulation and nothing revolutionary.  We also are using an example that leads us naturally to think of an object.  A person is an object type that is naturally created when we want to represent people in a system.  We know we want to encapsulate data into a person object early in design because we know that a person has many characteristics we want to put together.  We are using a person only as a concrete example.  The type of situation we are discussing here is one where originally, there is a single relationship and over time, similar relationships are added.  If our original relationship is done with a map then our future relationships will most likely be added with a map as well.  This is natural because programmers tend to follow the pattern of existing code.  Unless someone takes a step back at some point you easily could wind up with 5, 10, 20 maps mapping name to some other values.  The class containing these mappings feeds on itself because as the pattern becomes more pervasive there is less incentive to change.  This probably won’t happen with person data which can clearly be thought of in object terms.  It can easily happen with abstract concepts within a system that are hard to translate into a real world object because people probably aren’t thinking of the relationships between values as object defining relationships. 
Let’s take a step back and circle back to my previous point about circumventing the object orientedness of the language.  When you use collections to make connections between name and nicknames or name and birthday you are creating a weak relationship.  The relationship only exists as far as that Dictionary or HashMap exists.  You haven’t broken the typing system and syntactically you are still correct, but you aren’t following good object oriented design practices.  When you create the class that encapsulates the data you create strong relationships that are self evident to someone reading and using your code. 
Again, this is not the full problem, but another stepping-stone onto seeing the problem and the remedy for it.  It still can be manageable to handle data with only the collections if there is a small number of relationships.  It is also easy to refactor if we begin to recognize we should encapsulate these relationships together.  Next time we’ll start to see when writing code in this manner quickly becomes a problem as we add more levels.