EJAPP Top 10 countdown: #1 - Incorrect database usage

Vincent Partington

I'll keep you in suspense no longer. ;-) It's time for numero uno of the EJAPP Top 10 countdown!

Somewhat unexpectedly for an Enterprise Java Application Performance Problems Top 10, the #1 issue is the incorrect usage of databases.

In a lot of Enterprise Java projects, the database is treated as a simple bit bucket and the DBA as just the guy that executes the SQL statements on the production machines. This disdain for the database and its experts leads to badly performing applications such as:

  • A database that had a table with two columns: KEY and VALUE. Java objects were stored in here by storing property names in the KEY column and their values in the VALUE column. The motivation for this design was extensibility: no changes to the database were needed when the Java objects changed. Of course, there was a lot of code dealing with converting Java objects and their datatypes (int, boolean, etc.) into VARCHARs, that (a) did not perform, (b) had to be changed for every Java object modification, and (c) made it impossible to create proper indexes.
  • An front-end application that read its data from views. The tables behind those views had indexes, but not on the columns used by the queries for the views!

Instead of pointing the finger at one another, Java developers and DBA's should work together in all stages of a project to make sure the database and the Java application server deliver the performance needed by the application.

Design

  • Think about what data you actually need to persist and when and who you are going to persist it. Also, have a look at the EJAPP Top 10 entry on improper caching on what to do (and not to do!) when dealing with caches.
  • Model your data. Don't resort to the key/value pair design mentioned in the example above.
  • Normalize your database to at least the third normal form. Anything past that is probably not worth the trouble.
  • Choose your primary key wisely. Your choice will be around for a long time.
  • Learn about the data types your database offers and their performance impact (Oracle).
  • Use the database for what's it good at: referential integrity checking, validation, manipulation, selection and sorting of large data sets. Don't do this stuff in your Java code. For example, don't write two queries and then merge the results in Java: use a join. Seems obvious, but merging in Java still happens a lot. Think EJB n+1 problem!
  • Finally, consider using stored procedures when manipulating large data sets instead of retrieving the data, manipulating it in Java and then writing it all back.

Implementation

  • Create indexes where needed. At least on the primary key columns (most databases do this automatically) but also consider creating indexes for foreign key columns to speed up joins. Don't go overboard or the database will spend a lot of time updating the indexes. And you will spend a lot of time administering them.
  • Don't retrieve all columns with every query (SELECT * FROM ...), but just the ones you need. You can also do this in HQL and EJBQL.
  • Know how your database handles transactions and concurrency (Oracle, DB2, MySQL) and use that do determine the transaction attributes and isolation levels you'll be using. See the EJAPP Top 10 entry on incorrectly implemented concurrency for more info.
  • Don't be afraid to use the features offered by the database you're using. Database choices are usually even more rigid than application server choices, so it pays to learn all the tricks your database offers and exploit them to get maximal performance.

Deployment

  • Tune the database for the application, just like the application server needs to be tuned. Check the performance tuning guides for Oracle, DB2, and MySQL. I really liked O'Reilly Oracle Performance Tuning book, but unfortunately the last edition is from 1997. Still, it is an inspiring read if just for reading how these professionals approach performance problems and their projects in general.
  • Specifically, examine the query execution plan (a.k.a. the explain plan) for the queries your application uses. Check the cost of any full table scans and add indexes where needed. Oracle, DB2, and MySQL all offer tools to do this.
  • Learn how the query optimizer of your database (Oracle, DB2, MySQL) works. For example, Oracle 9i executes queries a LOT faster when the tables in both the FROM and the WHERE clauses are ordered from large to small. Changing the order of the tables in the FROM and WHERE clauses can make the execution time of a query drop from over 60 seconds to less than 1 second.
  • Profile your JDBC calls. The open source tool P6Spy gives you insight in performance of your JDBC calls. The commercial tool Quest Performasure does that and a whole lot more.

To conclude, Enterprise Java developers will need to understand their database and cooperate with their DBA to get maximum performance out of their application!

Thanks to Barre Dijkstra and Wouter van Reeven for providing valuable input for this blog.

Comments (3)

  1. James Stansell - Reply

    April 30, 2007 at 10:49 pm

    The point about "Changing the order of the tables in the FROM and WHERE clauses" for oracle 9i can also apply to 8i and 10g when the rule-based optimizer (RBO) (RULE mode) is active. In general the point doesn't apply when the cost-base optimizer (CBO) is active.

    This gets back to 2 of your other points: 1) configure your DB properly; and 2) understand the query plans that are being used.

    There's a LOT of outdated or just plain wrong information available on the internet. I've learned the hard way regarding Oracle. The bottom line is to verify that the advice you follow actually does improve performance for your app. Bonus points for understanding why the tip worked for you.

    -james.

  2. Ed Kusnitz - Reply

    May 1, 2007 at 7:56 pm

    Our dba took exception to this same point about the ordering of tables:
    "Oracle 9i executes queries a LOT faster when the tables in both the FROM and the WHERE clauses are ordered from large to small.”
    Well, it depends what he means by this, but this is pretty much wrong. Let's say that I'm joining parent table A to a couple of it's child lookup tables:

    select *

    from A

    inner join B on A.b=B.id

    inner join C on A.c=C.id

    inner join D on A.d=D.id

    In this example, it doesn't matter *at all* what order I list the 3 joins in. Since we're inner joining, the contents of B, C, and D each act as filters on A; we also want to do as little IO as possible to get the correct answer as quickly as possible. Thus, the fastest way to do this operation (imagine you had to do it by hand in Excel, if that helps) is to start with the smallest child table, use it's index on "id" (hopefully) to quickly match a row in A to a value in the child table, and then remove everything that doesn't match from our buffered copy of A[1]. That means that when we then compare to the remaining 2 child tables, we've got the smallest possible set of rows to join to the child's index. We repeat the procedure by picking the smaller (ie, the better filter) of the 2 remaining child tables first, and we do the "loosest" join last.

    That's a simple example of finding an optimum execution plan. In order to find it, the 2 pieces of information I needed were (1) the sizes of the tables involved, and (2) whether or not the join columns were indexed. If you wanted to be more precise while trying to decide which join-filter to apply first, you might also take into consideration things like how easy index was to use (if it's got a lot of columns you don't need in it, the IO in the index can be significant) and the cardinality (selectiveness) of your join condition in the child table, if that condition isn't unique.

    These are all things the optimizer looks at, guaranteed. Sizes are gathered by statistical sampling; you can see those stats in the system view all_tables, for example. Indexes are obviously in the data dictionary as well, and the optimizer has stats to refer to for those that tell it about cardinality and fragmentation.

    --Can you explain more what you mean?

    AFAIK, the only case where join order matters in something like this is when size and cardinality are the same, which is really an edge case...

  3. Wouter van Reeven - Reply

    May 3, 2007 at 9:56 am

    Hi Ed,

    This remark originated from me. In your example, let's say we would like to execute a query like this

    select * from A, B, C, D where A.b = B.id and B.c = C.id and A.d = D.id

    Please note that A has a foreign key to both B and D. I noticed in several cases that the query execution time decreases a lot if size(A) > size(B) > size(D). If e.g. size(A) > size(D) > size(B) the query should be

    select * from A, D, B, C where A.d = D.id and A.b = B.id and B.c = C.id

    I do agree that this is only the case when one table has foreign keys to more than one other table and if one of those other tables contains much more rows than the others.

    Greets, Wouter van Reeven

Add a Comment