Dan Newcome, blog

I'm bringing cyber back

Archive for July 2012

Data persistence, normalization and cardinality

leave a comment »

I’m not a big fan of object-relational mappers and tricky data access schemes. I’m also a fan of good data normalization in relational database systems. I’m somewhat new to key-value and document stores, but I have done projects using things like Amazon’s new DynamoDB.

I’m doing some work on a side project that involves counting user votes. I decided to use MySql as the persistence mechanism for this project simply because it is the de-facto open source relational database. I’m not trying to innovate here with the database, so it makes sense to use something really familiar.

However, I’m realizing that even with a simple project like News.js, there are major tradeoffs to make when it comes to the database layer. Tradeoffs that affect the design of the database schema.

Consider a typical Digg or Reddit news site where users can upvote or downvote an article. The votes could be considered a kind of metadata, and the article itself could be treated as a sort of document. There are essentially two ways to store this metadata – inline as a field on the document, or as a second “vote” entity that points back to the document.

Consider the following:

create table document (
	id integer not null auto_increment primary key,
	title varchar(1000), 
	body longtext,
	votes integer
);

This is a bare minimum schema for a document that can track user votes. Every time a user votes on a story we just increment the vote count. However, there are some problems with this schema both from a functionality standpoint and a performance standpoint. From a functionality standpoint, we don’t know who voted, so we can’t prevent someone from voting more than once. From a performance standpoint, if we have a lot of people trying to vote at once, we can have a lot of contention to write to the document table.

Let’s come up with a simple vote abstraction in the form of a linked database table. Here is a first attempt:

create table vote (
    id integer not null auto_increment primary key, 
    fk_user_id int not null, 
    fk_document_id int not null,
        
    foreign key (fk_user_id) references user(id),
    foreign key (fk_document_id) references document(id)
);

Now we know who voted because we have a field that tracks the user id of the voter. However, we still need to maintain the vote count on the document table unless we do a join to aggregate the individual votes. Also if we want to keep a user from voting more than once we have to constrain the user-vote cardinality ourselves in the code. The database schema is not expressive enough to articulate this constraint, and in addition we had to decide how to structure the data ourselves.

Decoupling data from schema

Databases have some physical storage format on disk or in memory. Indexes help us get that data quickly when we issue queries against that data. However, the indexes are kind of a grey area with relational databases. They are not strictly part of the schema although in many cases they are necessary for the data to be accessed at all in a deployment scenario with a lot of data. In many cases the size of a table is a consideration in the overall schema design.

Should sorting be the responsibility of the database? Sorts are one of the most expensive operations in a database, often requiring whole tables to be scanned or their indexes pulled into the database’s memory footprint. In order to save on sorting I have sometimes created index tables that are pre-sorted on disk in order to facilitate different sort orders. Using a clustered index, we can have immediate access to sorted data since the data is pre-sorted on disk. However, this is a huge denormalization step in the database.

What if we could have a better front-end into our data? Some way to scale out horizontally and create usage-optimized windows into the data? Can we extend the concept of database indexes to “index servers” that are backed by the core persistence engine?

Semantic schemas

Traditional database schemas define key-based relationships between data. These links are used in queries mostly via SQL join statements although they also play a big part in performance. Already there are some relationships that are tedious to define in a database. Many-to-many relationships require a tertiary table that stores only the mapping between the two entity tables in the database. In a functional sense, a map is a rather elegant way of defining a relationship. On the other hand, having another table clogging up the database is a drag. Perhaps these are orthogonal concerns. What if the database was defined in a more functional way from the ground up? The idea of maps and sets is a very common one in the functional programming world, and essentially what we have in a relational database is sets of tuples that map to one another.

Map reduce vs. count()

I always cringe a little bit when I see a database procedure that uses a lot of functions like max and group by. SQL is a very powerful set language, but this power comes at great computational expense. Sorting and counting in a traditional database is expensive. However, in a distributed database, this turns out to be a relatively cheap operation since it is easily parallelizable. In a typical RDBMS we don’t have this ability to parallelize the workload, and we are relegated to grouping and counting.

It seems to me that we shouldn’t have to make decisions about our data based on the retrieval mechanism. This is the whole idea of SQL – that is to enable us to tell the datastore “what” we want and let it determine “how”. In practice however, we have to think about the “how” up front when we design our schema!

Advertisements

Written by newcome

July 7, 2012 at 11:15 am

Posted in Uncategorized

On dependency management

with one comment

Like many (all?) developers, I have a list of thoughts and gripes about managing and developing libraries while working on a coding project. The experience that prompted me to finally start writing this post was trying to get the Selenium WebDriver Java libraries installed and ready to use on my system.

I think I could have downloaded a giant tarball of everything, but I really wanted to just use Maven or some automatic dependency management tool. Initially I was using Clojure, but I realized that the Clojure wrapper library didn’t fully support some things I needed, so I figured I’d dive down and just do the thing in Java. Why do I mention this? Well, I suppose the Clojure guys have things pretty well under control, as grabbing the dependencies from Clojars using Lein seemed to work just fine. Lein uses Maven under the hood, so I figured I’d be good to go.

Figuring out how a Maven POM file is supposed to go is its own level of pain when all you want to do is have your dependencies and classpath set up for you. I still didn’t really figure out how this was supposed to work as I went on to try Ivy, which is a tool that lets you use Maven repositories from Ant build scripts, which I’m more familiar with.

Installing Ivy went pretty well, but when I got it to pull down the dependencies I realized that Selenium relies on a version of Jetty that was hosted by JBoss, who apparently no longer hosts the Jetty project. So now the dependency resolution process is broken. I’m not sure how to re-point this dependency since the process is kind of opaque, but I figure that I can get Ivy to resolve this locally from my Maven cache if I add it manually. Once added, Ivy still won’t pick it up since Ivy uses its own private cache. After a little more research I figured out that you need to install the dependency in the Ivy local repository rather than the cache. At this point I’m wondering why Ivy needs its own cache separate from the .m2 Maven folder.

Rather than answer this question, I pretty much threw out the idea of using Java at all for this task. It’s just too much hassle.

This experience got me thinking about package managers and dependency resolution. There are some inherent difficulties in getting this stuff right, but I’m wondering if a fully pedantic approach as is taken by Maven is really appropriate.

Strict vs. loose

I think at the very lowest level, a dependency management system should be able to pull down a library and its dependencies directly with a single command. Node’s NPM does this right as well as Rubygems etc. It’s not clear to me whether Maven can do this at all, or if it can it might need some kind of special boilerplate POM file in order to do it. The idea is, when working on a project, sometimes there comes a time when “hey we need to add a logging framework now”. We aren’t trying to maintain some kind of solid build or backward compatibility. We don’t need ironclad versioning at the top level. However, I do expect to get a working library with all of its exact dependencies included. In many cases a common library will include dozens of other libraries. If any one of these libraries is not available from the package management system, the toplevel dependency request will fail.

I feel like this is too much to ask of any system that relies on third parties publishing code. The lowest common denominator is a tarball release from a vendor that has a working binary version of the library. The dependency management tool should be able to deal with this case, or any case that would be supported by the language/environment. Also, If I’m relying on a particular version of something, I want to have my own local repository set up possibly. This is almost always possible, but varies greatly in difficulty between different dependency management tools.

Progressive enhancement

On the subject of local repositories, I think that most environments make code modularity much too difficult in practice. Java and C# tend to encourage an up-front proliferation of libraries and namespaces. If pulling code out into libraries was easier, it wouldn’t be necessary to make this up-front decision so quickly.

My feeling is that NuGet is a step in the right direction, although I think they are way off base with some things. The ability to set up a plain old folder as a repository is gold. The bootstrapper thing is kind of silly as you’d expect that the tool shouldn’t get so huge that it needs a bootstrapper and that if you wanted to update your version, the normal tool could handle that just fine. I tried to get NuGet working under Mono, and it doesn’t work. It boggles my mind that NuGet needs to use something so cutting-edge that the latest Mono drop can’t run it.

Wish list

Anything can be a package
Repositories can be just folders
Repositories don’t have to be just folders
Projects can be defined inline
Dependencies can be redirected simply

I actually thing that git could provide a really awesome back end for a dependency management tool and could satisfy many of the above wishes very easily.

Hopefully I’ll dig into this a little bit more soon.

Written by newcome

July 2, 2012 at 3:42 pm

Posted in Uncategorized