Graphical SQL query builder

Journal DaChesserCat's Journal: Graphical SQL query builder

Journal by DaChesserCat on Thursday February 20, 2014 @12:07PM

There has been significant talk, as of late, about how to to do programming using a graphical interface, instead of the usual text-based ones. A previous entry into this journal talked about that. It's time to start nailing down some more concrete ideas about this.

First off, I tend to think of anything involving aggregates (and yes, SQL queries are operations across an aggregate of records/objects/tuples/whatever) as map-reduce operations. You create a function which will filter/transform each tuple individually (map) and you may do something across the resulting set to reduce the total number of tuples. Selecting a subset of fields from a tuple is a map operation. Selecting a subset of tuples from the total set is a reduce operation. As such, you can think of a SQL query as one or more map-reduce operations. For example:

select f1, f2, f3 from t1 where f1 = 'blah' and f2 = 1

We do map across tuples from set t1, emitting only fields f1, f2 and f3. We do a reduce across the records such that we only emit records where f1 = 'blah' and f2 = 1. In principle, the output of map -> reduce will be the same as the output of reduce -> map. In reality, we may want to do the reduce(s) first, emitting only the primary key values from t1 where the fields meet the criteria, then do the map across the fields for tuples with the appropriate primary keys. The 'where' clause can be broken into two pieces (f1 = 'blah'; f2 = 1) with the output of one feeding the input of another. As such, we would have a reduce (f2 =1; comparing numbers goes faster than comparing strings) feeding into another reduce (f1 = 'blah') feeding into a map. The reduce operations can run in parallel (functional partitioning) and the map operation can be run in parallel (data partitioning), feeding into an implicit reduce operation which puts all of the parallel-generated output results into some single stream. Many databases already do this behind the scenes when they build a query plan, using various internally-tracked statistics to determine which comparisons should be done first, using a knowledge of which fields are indexed and which aren't, to optimize the performance of the query. And if you run the same query twice, with slightly different values, the second query will usually run much faster because the database is able to leverage the already-built query plan.

So, how do we represent this, visually?

We can start with the tree-based structure mentioned in my previous journal article. But that limits you to a particular sequence of events. I'd like something a little more abstract. I'd like the system to be able to optimize the query.

Represent each table/tupleset as a circle. There are lines radiating off this circle, representing fields/attributes in each tuple in the set. You can connect one or more lines from one circle to one or more lines on another such circle, indicating a join. We would need some kind of graphical representation indicating inner/outer join. Fields which are indexed should be easily recognizable. Field(s), singular or composite, which comprise the Primary Key should also be easily recognizable.

You wrap another polygon around that, with a line reaching the edge of the polygon. This polygon represents a filter/reduce/where clause. You would have to put something on the edge of that polygon representing the limiting criterion. Need multiple criteria, such as an 'and' statement? Add another such polygon around that one, pulling a field out to this polygon and specifying criteria for this field. Each criterion polygon is an implicit 'or' statement; you can have multiple criteria for a single field. Multiple concentric filter polygons can be merged into one, thicker filter polygon, hiding the overall filtering criteria. Or a thick one could be 'exploded' back into multiple, single-field-criteria polygons with one dragged inside or outside another, if you really want to mess with the order in which they're done.

What fields come out of this query? Find the lines representing the values on the initial tupleset circles and pull them through all the polygons, such that they radiate outwards. These can, then, be connected to other tuplesets, filtered or otherwise.

You can do a filter around the result of a join. But, if the filter is only applied to one field from one tupleset within that join, the filter can be 'shrunk' down to surround that tupleset, with the join happening after the filtering. And, realistically, this what the database will probably do, internally, so as to reduce the number of tuples which need to be joined.

We would also need another polygon representing modifications to the fields. For example, if you need a case...when statement, that would be another polygon which filters the values. One or more lines from the tupleset would hit the polygon and one line would come out. An aggregate statement, such as a sum() or count() would also have a polygon. And one polygon can be pulled inside or outside another to indicate what order (inside -> outside) the various modifications can be applied. Naturally, if an aggregate depends on the output of a modification, the modification will have to remain inside the aggregate. Or they can all be squashed into one, thick polygon, to hide the actual representation of what's going on and make it easy to use the output of this within a larger query.

What about insert, update or delete statements? Many of those involve queries to extract values which will be inserted or updated, or they use queries to determine which fields get modified or deleted. So a query is still the basic building block of all of those.

And, of course, as with any touch-based application, you can zoom in/out to show/hide the details.

So, what's the deal with each filtering polygon only filtering on one field? And one modifying polygon only producing one output? That's where factoring comes into play. If you end up with multiple criteria which all do the same modification or all do the same filter, it may be possible to pull those 'components' out and create a view. Then your queries can filter from that view. It would be relatively straightforward to spot when this is possible and it would be relatively easy for the editor to allow you to select the tupleset(s), possibly joined, the modifications and the filters and refactor everything. Anyone who's done significant work with a database knows that a querying a view will, quite often, provide a performance improvement over querying raw tables. Few and far between, however, are the people who are able to spot these refactoring opportunities, looking at raw SQL code.

Finally, this format isn't limited to SQL. If you put a filter on a field which does a regular expression match, but the database you are targeting can't do a regular expression match, that would indicate that some of the work for this query will need to be done programmatically, outside of the SQL query. And, if you are targeting a NoSQL database or an XML-based object store, the querying can be turned into XPath/XQuery and the modifications can be turned into XSL. In this fashion, this could provide a powerful tool not just for visualizing, but also for code generation.

Naturally, it should be able to parse existing SQL queries and, if possible, XPath/XQuery/XSL and create the visual representation. In this fashion, it would extremely useful for maintaining existing codebases. Couple that with the ability to generate code, possibly different from the input languages, and you have a powerful tool for migrating from, say, SQL-based relational databases/tables to NoSQL or XML object stores.

This discussion has been archived. No new comments can be posted.