Sunday, July 5, 2015

Postgres indexes

Recently, I had a situation where I needed to think how I was using Postgres indexes. I had a simple Book table with the following schema...
>\d book

                  Table "shopping.book"
       Column        |          Type          | Modifiers 
---------------------+------------------------+-----------
 id                  | uuid                   | not null
 version             | bigint                 | not null
 amount_minor_units  | integer                | not null
 currency            | character varying(255) | not null
 author       | character varying(255) | not null
 publisher           | character varying(255) |    
The author and publisher columns were just String pointers to actual Author and Publisher references that were on another system meaning that classical foreign keys couldn't be used and that these dudes were just normal columns.

I needed to get an idea how the table would perform with lots of data, so first up some simple SQL to put in lots of test data:

 
> CREATE EXTENSION "uuid-ossp";
> insert into book (id, version, amount_minor_units, currency, author, publisher) 
select uuid_generate_v4(), 2, 22, 'USD', 'author' || x.id, 'publisher' || x.id from generate_series(1,1000000) AS x(id); 
This table was going to be hit lots of times with this simple query:
select * from book where author = 'Tony Biggins' and publisher='Books unlimited';
To get the explain plain, I did:
dublintech=> EXPLAIN (FORMAT JSON) select * from book where author = 'Tony Biggins' and publisher = 'Books unlimited';
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 [                                                                                                                                +
   {                                                                                                                              +
     "Plan": {                                                                                                                    +
       "Node Type": "Seq Scan",                                                                                                   +
       "Relation Name": "book",                                                                                                   +
       "Alias": "book",                                                                                                           +
       "Startup Cost": 0.00,                                                                                                      +
       "Total Cost": 123424.88,                                                                                                   +
       "Plan Rows": 1,                                                                                                            +
       "Plan Width": 127,                                                                                                         +
       "Filter": "(((author)::text = 'Tony Biggins'::text) AND ((publisher)::text = 'Books unlimited'::text))"+
     }                                                                                                                            +
   }                                                                                                                              +
 ]
(1 row)
As can be seen, Postgres is doing a Seq Scan, aka a table scan. I wanted to speed things up. There was only one index on the table which was for the id. This was just a conventional B-Tree index which would be useless in this query since it wasn't even in the where clause. Some of options I was thinking about:
  • Create an index on author or publisher
  • Create an index on author and create an index on publisher
  • Create a combination index on both index and publisher.
Hmmm... let the investigations begin. Start by indexing just author.
dublintech=> create index author_idx on book(author);
dublintech=> EXPLAIN (FORMAT JSON) select * from book where publisher = 'publisher3' and author='author3';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 [                                                                            +
   {                                                                          +
     "Plan": {                                                                +
       "Node Type": "Index Scan",                                             +
       "Scan Direction": "Forward",                                           +
       "Index Name": "author_idx",                                  +
       "Relation Name": "book",                                               +
       "Alias": "book",                                                       +
       "Startup Cost": 0.42,                                                  +
       "Total Cost": 8.45,                                                    +
       "Plan Rows": 1,                                                        +
       "Plan Width": 127,                                                     +
       "Index Cond": "((author)::text = 'author3'::text)",+
       "Filter": "((publisher)::text = 'publisher3'::text)"                     +
     }                                                                        +
   }                                                                          +
 ]
(1 row)
As can be seen Postgres performs an index scan and the total cost is much lower than the same query which uses a table scan. What about the multiple column index approach? Surely, since both are used in the query it should be faster again, right?
dublintech=> create index author_publisher_idx on book(author, publisher);
CREATE INDEX
dublintech=> EXPLAIN (FORMAT JSON) select * from book where publisher = 'publisher3' and author='author3';
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 [                                                                                                                        +
   {                                                                                                                      +
     "Plan": {                                                                                                            +
       "Node Type": "Index Scan",                                                                                         +
       "Scan Direction": "Forward",                                                                                       +
       "Index Name": "author_publisher_idx",                                                                     +
       "Relation Name": "book",                                                                                           +
       "Alias": "book",                                                                                                   +
       "Startup Cost": 0.42,                                                                                              +
       "Total Cost": 8.45,                                                                                                +
       "Plan Rows": 1,                                                                                                    +
       "Plan Width": 127,                                                                                                 +
       "Index Cond": "(((author)::text = 'author3'::text) AND ((publisher)::text = 'publisher3'::text))"+
     }                                                                                                                    +
   }                                                                                                                      +
 ]
(1 row)
This time Postgres, uses the multi-index, but the query doesn't go any faster. Mai, pourquoi? Recall, how we populated the table.
insert into book (id, version, amount_minor_units, currency, author, publisher) 
select uuid_generate_v4(), 2, 22, 'USD', 'author' || x.id, 'publisher' || x.id from generate_series(1,1000000) AS x(id); 
There are lots of rows, but every row has a unique author value and a unique publisher value. That would mean the author index for this query should perform just as well. An analogy would be, you go into a music shop looking for a new set of loudspeakers someone has told you to buy that have a particular cost and a particular power output (number of watts). When you enter the shop, you see the speakers are nicely ordered by cost and you know what? No two sets of loudspeakers have the same cost. Think about it. Are you going to find the speakers any faster if you use just use the cost or you use the cost and the loudspeaker?

Now, imagine the case if lots of the loudspeakers were the same cost. Then of course using both the cost and the power will be faster.

Now, let's take this point to the extremes in our test data. Suppose all the authors were the same. The author index becomes useless and if we don't have the author / publisher combination index we would go back to table scan.

// drop combination index and just leave author index on table 
dublintech=> drop index author_uesr_ref_idx;
DROP INDEX
dublintech=> update book set author='author3';
dublintech=> EXPLAIN (FORMAT JSON) select * from book where publisher = 'publisher3' and author='author3';
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 [                                                                                                                    +
   {                                                                                                                  +
     "Plan": {                                                                                                        +
       "Node Type": "Seq Scan",                                                                                       +
       "Relation Name": "book",                                                                                       +
       "Alias": "book",                                                                                               +
       "Startup Cost": 0.00,                                                                                          +
       "Total Cost": 153088.88,                                                                                       +
       "Plan Rows": 1,                                                                                                +
       "Plan Width": 123,                                                                                             +
       "Filter": "(((publisher)::text = 'publisher3'::text) AND ((author)::text = 'author3'::text))"+
     }                                                                                                                +
   }                                                                                                                  +
 ]
(1 row)
So we can conclude from this that single column indexes for combination searches can perform as well as combinational indexes when there is a huge degree of variance in the data of that single column. However, when there isn't, they won't perform as well and a combinational index should be used. Yes, I have tested by going to extremes but that is the best way to make principles clear.And please note: For the case when there is maximum variance in data, adding another index to the other column in the where clause, publisher made no difference. This is as expected.

Ok, let's stick with the case when there is massive variance in data values in the column. Consider the case of maximum variance and the query only ever involves exact matching. In this case, all authors values are guaranteed to be unique and you never have any interest in doing anything like less than or greater than. So why not use a hash index instead of a B-Tree index?

dublintech=> create index author_hash on book using hash (author);
dublintech=> EXPLAIN (FORMAT JSON) select * from book where publisher = 'publisher3' and author='author3';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 [                                                                            +
   {                                                                          +
     "Plan": {                                                                +
       "Node Type": "Index Scan",                                             +
       "Scan Direction": "NoMovement",                                        +
       "Index Name": "author_hash",                                 +
       "Relation Name": "book",                                               +
       "Alias": "book",                                                       +
       "Startup Cost": 0.00,                                                  +
       "Total Cost": 8.02,                                                    +
       "Plan Rows": 1,                                                        +
       "Plan Width": 127,                                                     +
       "Index Cond": "((author)::text = 'author3'::text)",+
       "Filter": "((publisher)::text = 'publisher3'::text)"                     +
     }                                                                        +
   }                                                                          +
 ]
(1 row)
Interesting, we have gone faster again. Not a massive difference this time around but an improvement nonetheless that could be more relevant with more data growth and / or when a more complex query with more computation is required. We can safely conclude from this part that yeah if you are only interested in exact matches then the hash index beats the b-tree index. Until the next time take care of yourselves. References:
  • http://www.postgresql.org/docs/9.1/static/using-explain.html
  • http://www.postgresql.org/docs/9.3/static/indexes-bitmap-scans.html

Tuesday, June 23, 2015

Problems with Cobertura and Sonar 5.1

Recently, I was having some bother trying to use Sonar 5.1 with my Grails 2.4.4 project. I was using the usual Groovy stuff: Gmetrics, Codenarc and Cobertura. For the Sonar database I was using Postgres 9.4.

The logfile for the Sonar runner just gave me this:

build 22-Jun-2015 07:44:30 INFO: ------------------------------------------------------------------------
build 22-Jun-2015 07:44:30 INFO: EXECUTION FAILURE
build 22-Jun-2015 07:44:30 INFO: ------------------------------------------------------------------------
build 22-Jun-2015 07:44:30 Total time: 9.153s
build 22-Jun-2015 07:44:30 Final Memory: 30M/1039M
build 22-Jun-2015 07:44:30 INFO: ------------------------------------------------------------------------
error 22-Jun-2015 07:44:30 ERROR: Error during Sonar runner execution
error 22-Jun-2015 07:44:30 ERROR: Unable to execute Sonar
error 22-Jun-2015 07:44:30 ERROR: Caused by: Unable to save file sources
error 22-Jun-2015 07:44:30 ERROR: Caused by: -1
Not much use! I thought there was some permission problem, since "Unable to save file sources" usually means that! But there were no permission issues. I then disabled the Cobertura part of the analysis and things were ok, so it was something wrong with the Cobertura part. I then:
  • enabled verbose logging -- sonar.verbose=true
  • enabled full stack trace logging -- using the -e switch
  • enabled full debug logging with the -- using the -X switch
this provided a few more clues.
error 22-Jun-2015 11:09:06 ERROR: Error during Sonar runner execution
build 22-Jun-2015 11:09:06 INFO: ------------------------------------------------------------------------
error 22-Jun-2015 11:09:06 org.sonar.runner.impl.RunnerException: Unable to execute Sonar
error 22-Jun-2015 11:09:06  at org.sonar.runner.impl.BatchLauncher$1.delegateExecution(BatchLauncher.java:91)
error 22-Jun-2015 11:09:06  at org.sonar.runner.impl.BatchLauncher$1.run(BatchLauncher.java:75)
error 22-Jun-2015 11:09:06  at java.security.AccessController.doPrivileged(Native Method)
error 22-Jun-2015 11:09:06  at org.sonar.runner.impl.BatchLauncher.doExecute(BatchLauncher.java:69)
error 22-Jun-2015 11:09:06  at org.sonar.runner.impl.BatchLauncher.execute(BatchLauncher.java:50)
error 22-Jun-2015 11:09:06  at org.sonar.runner.api.EmbeddedRunner.doExecute(EmbeddedRunner.java:102)
error 22-Jun-2015 11:09:06  at org.sonar.runner.api.Runner.execute(Runner.java:100)
error 22-Jun-2015 11:09:06  at org.sonar.runner.Main.executeTask(Main.java:70)
error 22-Jun-2015 11:09:06  at org.sonar.runner.Main.execute(Main.java:59)
error 22-Jun-2015 11:09:06  at org.sonar.runner.Main.main(Main.java:53)
error 22-Jun-2015 11:09:06 Caused by: java.lang.IllegalStateException: Unable to save file sources
error 22-Jun-2015 11:09:06  at org.sonar.batch.index.SourcePersister.persist(SourcePersister.java:84)
error 22-Jun-2015 11:09:06  at org.sonar.batch.phases.DatabaseModePhaseExecutor.executePersisters(DatabaseModePhaseExecutor.java:165)
error 22-Jun-2015 11:09:06  at org.sonar.batch.phases.DatabaseModePhaseExecutor.execute(DatabaseModePhaseExecutor.java:133)
error 22-Jun-2015 11:09:06  at org.sonar.batch.scan.ModuleScanContainer.doAfterStart(ModuleScanContainer.java:264)
error 22-Jun-2015 11:09:06  at org.sonar.api.platform.ComponentContainer.startComponents(ComponentContainer.java:92)
error 22-Jun-2015 11:09:06  at org.sonar.api.platform.ComponentContainer.execute(ComponentContainer.java:77)
error 22-Jun-2015 11:09:06  at org.sonar.batch.scan.ProjectScanContainer.scan(ProjectScanContainer.java:235)
error 22-Jun-2015 11:09:06  at org.sonar.batch.scan.ProjectScanContainer.scanRecursively(ProjectScanContainer.java:230)
error 22-Jun-2015 11:09:06  at org.sonar.batch.scan.ProjectScanContainer.doAfterStart(ProjectScanContainer.java:220)
error 22-Jun-2015 11:09:06  at org.sonar.api.platform.ComponentContainer.startComponents(ComponentContainer.java:92)
error 22-Jun-2015 11:09:06  at org.sonar.api.platform.ComponentContainer.execute(ComponentContainer.java:77)
error 22-Jun-2015 11:09:06  at org.sonar.batch.scan.ScanTask.scan(ScanTask.java:57)
error 22-Jun-2015 11:09:06  at org.sonar.batch.scan.ScanTask.execute(ScanTask.java:45)
error 22-Jun-2015 11:09:06  at org.sonar.batch.bootstrap.TaskContainer.doAfterStart(TaskContainer.java:135)
error 22-Jun-2015 11:09:06  at org.sonar.api.platform.ComponentContainer.startComponents(ComponentContainer.java:92)
error 22-Jun-2015 11:09:06  at org.sonar.api.platform.ComponentContainer.execute(ComponentContainer.java:77)
error 22-Jun-2015 11:09:06  at org.sonar.batch.bootstrap.GlobalContainer.executeTask(GlobalContainer.java:158)
error 22-Jun-2015 11:09:06  at org.sonar.batch.bootstrapper.Batch.executeTask(Batch.java:95)
error 22-Jun-2015 11:09:06  at org.sonar.batch.bootstrapper.Batch.execute(Batch.java:67)
error 22-Jun-2015 11:09:06  at org.sonar.runner.batch.IsolatedLauncher.execute(IsolatedLauncher.java:48)
error 22-Jun-2015 11:09:06  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
error 22-Jun-2015 11:09:06  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
error 22-Jun-2015 11:09:06  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
error 22-Jun-2015 11:09:06  at java.lang.reflect.Method.invoke(Method.java:606)
error 22-Jun-2015 11:09:06  at org.sonar.runner.impl.BatchLauncher$1.delegateExecution(BatchLauncher.java:87)
error 22-Jun-2015 11:09:06  ... 9 more
error 22-Jun-2015 11:09:06 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
error 22-Jun-2015 11:09:06  at java.util.ArrayList.elementData(ArrayList.java:371)
error 22-Jun-2015 11:09:06  at java.util.ArrayList.get(ArrayList.java:384)
error 22-Jun-2015 11:09:06  at com.google.protobuf.RepeatedFieldBuilder.getBuilder(RepeatedFieldBuilder.java:245)
error 22-Jun-2015 11:09:06  at org.sonar.server.source.db.FileSourceDb$Data$Builder.getLinesBuilder(FileSourceDb.java:2911)
error 22-Jun-2015 11:09:06  at org.sonar.batch.index.SourceDataFactory.
Now, I could see earlier in the log, that the Cobertura analysis had finished. I could also see that the Cobertura coverage.xml generated ok (this is the file which collates the code coverage info). The next step after creating the coverage.xml file was for the sonar runner to parse it and send a request to Postgres, something had to going wrong at the parsing stage since connecting to Postgres was definitely not an issue (remember everything fine when Cobertura disabled). I knew there was no problem sending the request to Postgres, so thought there must be something odd in the coverage.xml file which meant Sonar runner failed to parse it. As stated, the coverage.xml file details what line number for each class has and hasn't been covered. Sample:

    
        
             
                 
             
       
       ...

...
So what kind of things could make the parsing barf? What about if there was some odd line number in the coverage.xml file? hmmm... To check this, I ran the following grep:
> grep "line number" coverage.xml
This gave too much. What about any negative line numbers?
>grep "line number=\"\-" coverage.xml
Nope, none. Ok go back to exception, look at this line:
java.lang.ArrayIndexOutOfBoundsException: -1
hmmm... If a line number was 0, I wonder could it make some array parsing in the sonar runner throw index out of bounds?
>grep "line number=\"0" coverage.xml
Hit! Time to grep lines before and after and get more info about this file.
>grep -C20 "line number=\"0" coverage.xml
This gave me the culprit. It made no sense to me why Cobertura was saying that linenumber 0 had 0 hits. It was still possible to open the Cobertura html report and view the analysis. Sonar was just barfing when it was parsing it. So I removed this file from Cobertura analysis by adding the following to my build config.
coverage {
    xml = true
    exclusions = [
        "**/com/dublintech/me/MyOddFile*"
    ]
}
I then re-ran and hey presto, everything working. The file wasn't in the coverage.xml file. This meant the Sonar runner could parse the file and everything was ok.

I like sonar, I like a stable build and I like rapid feedback so yeah I was a happy person when it was working again!


Saturday, May 30, 2015

Using separate Postgres Schemas for the same database in a Grails App

Recently, I wanted to use the same Postgres Database but split my persistence layer into separate components which used separate schemas. The motivation was to promote modular design, separate concerns and stop developers tripping up over each other. Vertical domain models can be difficult to achieve but not impossible.

In my shopping application, I had a user component, a shopping component and a product component. Now this is pretty easy if you are using separate databases, but sometimes it's nice to just get the separation of concerns using separate schemas in the same database, since using the same database can make things like DR, log shipping, replication etc easier.

While I could find doc for separate databases, I found it difficult to find Grails doc to advice on my specific problem - how to use separate schemas when using the same database when using Postgres. So here is how I ended up doing it.

Here is my datasource.groovy.
String db_url = "jdbc:postgresql://localhost:5432/testdb"
String usernameVar = "db_user"
String passwordVar = "db_secret"
String dbCreateVar = "update"
String dialect = "net.kaleidos.hibernate.PostgresqlExtensionsDialect"

dataSource_user {
    pooled = true
    jmxExport = true
    dialect = dialect
    driverClassName = "org.postgresql.Driver"
    username = usernameVar
    password = passwordVar
    url = platform_url
    dbCreate= "validate"
}

hibernate_user {
    cache.use_second_level_cache = false
    cache.use_query_cache = false
    cache.region.factory_class = 'net.sf.ehcache.hibernate.EhCacheRegionFactory' // Hibernate 3
    singleSession = true // configure OSIV singleSession mode
    default_schema = "user"
}

dataSource_shopping {
    pooled = true
    jmxExport = true
    dialect = dialect
    driverClassName = "org.postgresql.Driver"
    username = usernameVar
    password = passwordVar
    url = platform_url
    dbCreate = "validate"
}

hibernate_shopping {
    cache.use_second_level_cache = false
    cache.use_query_cache = false
    cache.region.factory_class = 'net.sf.ehcache.hibernate.EhCacheRegionFactory' // Hibernate 3
    singleSession = true // configure OSIV singleSession mode
    default_schema = "shopping"
}

dataSource_product {
    pooled = true
    jmxExport = true
    dialect = dialect
    driverClassName = "org.postgresql.Driver"
    username = usernameVar
    password = passwordVar
    url = platform_url
    dbCreate= "validate"
}

hibernate_product {
    cache.use_second_level_cache = false
    cache.use_query_cache = false
    cache.region.factory_class = 'net.sf.ehcache.hibernate.EhCacheRegionFactory' // Hibernate 3
    singleSession = true // configure OSIV singleSession mode
    default_schema = "product"
}
Note: there are some obvious optimisations in config above, but the above just makes explaining simple.

I then mapped each GORM object to the appropriate schema.

class Cart {
    // ...
    // ...
    static mapping = {
        datasource 'shopping'
        // ... 
    }
}

class Address {
    // ...
    // ...

    static mapping = {
        datasource 'user'
    }
}

class Stock {
    // ...
    // ...

    static mapping = {
        datasource 'product'
    }
}
I then started my app and said "yippe, this works" had a little tea break and moved onto the next problem. As can be seen the trick is to use a separate hibernate closure, specify the schema in there and name the closure using the same naming format for separate database, but make the database closures point to the same database.

Sunday, June 29, 2014

Book Review: REST API Design Rulebook (Mark Masse)

RESTful style architectures are becoming more and more ubiquitous. There are many reasons for this:
  • Web Tiers are full of JavaScript yearning to make nice simple AJAX requests
  • The obvious shortcomings of SOAP style strong contracts 
  • They are a nice alternative to ESB's for integrating heterogeneous architectures
However, like the MVC design pattern or the Agile Software Methodology, while many projects may be claiming to be using the RESTful architectural approach everyone is doing it differently.

In 2010, Martin Fowler - in his excellent blog - discussed the Richardson Maturity Model. This model provided a very good categorisation technique to assess the degree of RESTfullness based on how many core REST principles were being used.  Although, that same model gets a reference in Mark Masse's REST API Design Rulebook, Masse's book goes much more into low level detail about various REST best practises.

For example:
  • Negative caching: adding cache headers not just for positive responses but to 3xx and 4xx responses.  This is something that may not be obvious, but could be a good performance / scalability boost depending on the nature of your application and user usage patterns etc.
  • How to version your URIs, your representational forms and your resource objects
  • Using a consistent forms to represent link relations 
In addition, there is a abundance of other ideas and guidelines, some pretty simple but nonetheless important:
  • Don't end URI's with a slash
  • Don't use underscores in URI paths
  • Put "api" in the domain part of your rest resource path, e.g. 
https://api.dropbox.com/1/oauth/request_token
The main reason why I think it is good to have a book like this is because when a development team are trying to use a REST style architecture, disagreements, misunderstanding will inevitably happen. For example, the proverbial: 'Should a URI end in a plural or singular noun?'
It is always good to be able to reference a respected industry resource to prevent rabbit holes appearing and eating into your valuable project time.

Furthermore, there are some really quick and easy things you can do to make a much a better REST API that are discussed in the book. For example:
  • Adding an ETag HTTP header to that shopping cart resource as items go in and out of it.
  • Using query fields to generate partial responses and using the ! for an excludes option.
Now for some constructive criticism.  Firstly, I don't there will ever be complete consistency in REST approaches. Some of the so called best practises could be argued to be just subjective or nice-to-haves. They are not something that are going to make a big difference to either functional or non-functional aspects of your architecture. Some of the industry leaders not only take different approaches in their REST APIs, but they are also sometimes doing the opposite of what Massé suggests. For example, Massé suggests not to include a file extension type in your REST URLs (see Chapter 2), but  (at the time of writing) Twitter do just that (the URI is: https://api.twitter.com/1.1/statuses/mentions_timeline.json)

Furthermore, in a lot of projects you will be writing a REST API purely to serve JSON back to a Web client.   This is quite different to a public facing API on the scale of Twitter, Amazon etc.   Therefore you need to ask yourself if you really need to put the time investment in so that you are adhering to every single REST best practise instead of just doing what makes sense for your project.  Some obviously make sense, some could be argued to be overkill. For example making sure you are sending back the correct HTTP code when you know the client never even checks if it is a 403 or 405.

I do think this book is written more as if you were designing a public facing API.  If you find yourself in that situation you should definitely, definitely, definitely be considering everything Massé is saying in the book.  But note, the emphasis is on considering which doesn't always mean adhering or adopting.

The book does a very good job in covering best practises that a lot of developers just wouldn't think of (setting response headers such as content-length, setting the location header in responses for newly created resources, how to support HTTP 1.0 when you are trying to discourage caching) and is definitely worth a read but you really have to think about what makes sense for your project. As stated, some of the suggestions are quick wins others you have to assess the cost and the benefit. Following the core basic REST principles (statelessness, a good resource model, uniform interface etc.) is what is really important after that the challenge is figuring out what works best for each specific project and how to make the most of your time. That will vary from project to project and will depend on project scope, scale etc.   A good architectural decision should always consider the various options but make the appropriate decision for the appropriate project.

Until the next time, take care of yourselves.