Karlsson on databases and stuff

Thursday, January 11, 2024

MariaDBTop 1.1 released

I have now released version 1.1 of MariaDBTop, it can be downloaded here. This version has several bugfixes, a number of small improvements and features and one rather big feature, which is the ability to show measurements in a "semi-graphic" way. The sample configuration file has also been anhanced and the documentation has some additional stuff in it, like an example section.

The vertical semi-graphics bars looks like this

Happy SQL'ing
/Karlsson

Monday, December 4, 2023

MariaDBTop in action

MariaDBTop is a tool I have come to use quite often but it is a bit hard to configure and set up. In this post I will show you have to set it up by using a realistic example. So, what I want to show is monitoring of a simple application that emulated Telco style CDRs (Call Data Records). A CDR is a very common data structure in Telco and network applications where a CDR is a representation of an event in the network such as a call being started, a message is being sent etc. CDRs are never updated, they are just inserted.

The CDR structure is not standardized but similar across the many places where it is used. In this case I have an application that inserts CDRs over multiple threads. The database that these records are recorded in is replicated and we want to keep an eye on the processing on both the primary and the replica.

Main configuration section

To begin with we have to configure the main section of the MariaDB configuration. To begin with let's use a decent colour screen. For this type of applications I like green text on a black background and for alerts I want a bright red text colour. We are using the default configuration file mariadbtop.cnf, all the configuration data goes into that in this example, and the first part of this file then has the mariadbtop section then looks like this

[mariadbtop]

default-colour=green

default-bgcolour=black

alert-attribute=normal

alert-colour=black

alert-bgcolour=red

We then provide connection information to the MariaDB Primary server

user=monitoruser
password=monitorpwd

With this in place we need to define the replica server that we are connecting to as well as the pages that are to be displayed, the actual attributes of these are defined later. We have one server to add, the replica, and three pages

pages=dashboard,replication,innodbstatus
servers=replica_server

With this in place I also want to refresh the status bit more frequently than usual and write alerts to a file

refresh-time=1000
alert-time=10
alert-file=alert.log

The above configuration will refresh the screen every second (1000 ms) and will write a record to the altert log file when there is an alert and every 10 seconds for ongoing alerts.

The above should be pretty much self explanatory

Defining the dashboard page

Defining the replica server attributes

Before we go on from here we need to define how to connect to the replica server. The name of this is, as defined above, replica_server which means we need a section called just that with information on how to connect to it. In my case it looks like this

[replica_server]
user=monitoruser
password=monitorpwd
host=repl1

The above should be pretty much self explanatory

Defining the dashboard page

Now we get to the interesting bit, the actual data items that we are to monitor and the first is the dashboard page.

[dashboard]
pagetitle=Dashboard
pagekey=d

This above defines the title displayed at the top of the dashboard page and the key used to navigate to it. The next thing we want to do is to display the number of rows in the table we work with, this is called cdr and it also resides in the cdr database and for this we could do a SELECT COUNT(*) but I decide against that and that will slow down the actual processing going on, so instead I run a SHOW TABLE STATUS command which gives a less exact result but doesn't intrude into the normal processing as much. I end up with this setting

sql1=SHOW TABLE STATUS IN cdr LIKE 'cdr'
vertical1
hide1
name-prefix1=primary_ts_

This will run the command above, the vertical setting will cause a row/value pair for each column and the hide attrbute will ensure that this data is fetched but not displayed, we will look at that later on. Finally the name_prefix attribute will prefix each column name fetched with primary_ts_. Following this I need to get the same data from the replica server, like this

sql2=SHOW TABLE STATUS IN cdr LIKE 'cdr'
vertical2
hide2
server2=replica_server
name-prefix2=replica_ts_

The only attribute different here is the server, as we are getting this from the replica. Also, we run the same comman so we need to use a different prefix so we don't userwrite the key / value pairs gotten from the previous command. We have now defined how to get the data we want to display, but we haven't actually displayed it. What we want to show is three things, the number of rows on the Primary and the Replica and the difference between the two. This is how this is done

expr10=this.primary_ts_Rows
name10=Primary_rows
expr11=this.replica_ts_Rows
name11=Replica_rows
expr12=this.primary_ts_Rows - this.replica_ts_Rows
name12=Replica rows behind

The difference may also be calculated like this

expr12=this.Primary_rows - this.Replica_rows
name12=Replica rows behind

The Replication status page

Now it is time to display the replication status, page, and we start by the page specific data

[replication]
pagetitle=replication
pagekey=e

With this in place we are ready to define how to get the data, on the replica we run a SHOW REPLICA STATUS command, and this has to be vertical as we want all the columns from that command, and on the master all we want at this point is the BINLOG GTID. This looks like this

sql1=SHOW REPLICA STATUS
vertical1
server1=replica_server
name-prefix1=replica_
hide1
sql2=SELECT @@GLOBAL.GTID_BINLOG_POS
name2=master_gtid_pos
hide2

Again, we have so far only gotten data that is not displayed, the data that we do want to dipslay now is the GTID pos on the master and the replica and the Seconds_Behind_Master column from the SHOW REPLICA STATUS command. This looks like this

expr10=this.replica_Seconds_Behind_Master
name10=Seconds behind
expr11=this.replica_Gtid_IO_Pos
name11=Replica GTID
expr12=this.master_gtid_pos
name12=Master GTID

And that completes the replication page.

The InnoDB status page

We start the InnoDB status page with the usual page details and then getting InnoDB status using SHOW GLOBAL STATUS.

[innodbstatus]
pagetitle=InnoDB Status
pagekey=i
sql1=SHOW GLOBAL STATUS LIKE 'Innodb%'
value-col1=Value
name-col1=Variable_name
name-prefix1=pgs_
hide1
sql2=SHOW GLOBAL STATUS LIKE 'Innodb%'
value-col2=Value
name-col2=Variable_name
name-prefix2=r1gs_
server2=replica_server
hide2

Now, we want to process the data we got above to something we can display. Let start with the InnoDB cache hit ratio, which is an expression based on two other values returned from SHOW GLOBAL STATUS, so for this we use these two

expr3=round(this.pgs_Innodb_buffer_pool_read_requests / \ (this.pgs_Innodb_buffer_pool_reads + \ this.pgs_Innodb_buffer_pool_read_requests) * 100, 2) \
name3=Primary InnoDB Cache hit ratio
expr4=round(this.r1gs_Innodb_buffer_pool_read_requests / \ (this.r1gs_Innodb_buffer_pool_reads + \ this.r1gs_Innodb_buffer_pool_read_requests) * 100, 2)
name4=Replica InnoDB Cache hit ratio

In addition to this we also like to get some other InnoDB data displayed, like this

expr5=this.pgs_Innodb_rows_read
name5=Primary Rows read
expr6=this.r1gs_Innodb_rows_read
name6=Replica Rows read
expr7=this.pgs_Innodb_rows_inserted
name7=Primary Rows inserted
expr8=this.r1gs_Innodb_rows_inserted
name8=Replica Rows inserted
expr9=this.pgs_Innodb_rows_updated
name9=Primary Rows updated
expr10=this.r1gs_Innodb_rows_updated
name10=ReplicaRows updated
expr11=this.pgs_Innodb_rows_deleted
name11=Primary Rows deleted
expr12=this.r1gs_Innodb_rows_deleted
name12=Replica Rows deleted

And that completes the configuration, we now have three pages of data

The result

With that in place the dashboard looks like this

And the InnoDB page like this

Configuration summary

In total, this is what the configuration looks like

[mariadbtop]

default-colour=green

default-bgcolour=black

alert-attribute=normal

alert-colour=black

alert-bgcolour=red

user=monitoruser
password=monitorpwd

pages=dashboard,replication,innodbstatus
servers=replica_server

refresh-time=1000
alert-time=10
alert-file=alert.log

[replica_server]
user=monitoruser
password=monitorpwd
host=repl1

[dashboard]
pagetitle=Dashboard
pagekey=d

sql1=SHOW TABLE STATUS IN cdr LIKE 'cdr'
vertical1
hide1
name-prefix1=primary_ts_

sql2=SHOW TABLE STATUS IN cdr LIKE 'cdr'

vertical2
hide2
server2=replica_server
name-prefix2=replica_ts_

[replication]
pagetitle=replication
pagekey=e

sql1=SHOW REPLICA STATUS
vertical1
server1=replica_server
name-prefix1=replica_
hide1
sql2=SELECT @@GLOBAL.GTID_BINLOG_POS
name2=master_gtid_pos
hide2

expr10=this.replica_Seconds_Behind_Master
name10=Seconds behind
expr11=this.replica_Gtid_IO_Pos
name11=Replica GTID
expr12=this.master_gtid_pos
name12=Master GTID

With that, i leave you to try MariaDB Top

Happy SQL'ing
/Karlsson

Thursday, November 23, 2023

Disruption, Evolution or Revolution

Disruption or Revolution

So let us assume this for a second: Larry Ellison wakes up one morning in his luxury San Francisco house, for once all alone in the building with the rest of the family in the Hawaii house. He gets out of bed and goes to the kitchen to make some breakfast. He decides on a cappuccino and some oatmeal porridge, he is a bit of a health nut after all. The coffee is prepared for him by his automatic machine that he imported from Switzerland and he micros his porridge and sits down by the bar in the kitchen to eat it. He opens the newspaper and starts reading it, but his mind drifts away, he had been dreaming something just when he woke up and he tries to remember what it was, and then, in the middle of half reading the editorial in the San Francisco Chronicle and half thinking about his recent dream he gets it.

The next step for the huge company that he once founded and now is the CTO of was what he was dreaming about and he now knows where Oracle will head. He feels invigorated in a way that he has not felt for a long time, maybe last time was as far back as 1977 when he decided to co-found a company to build the worlds first commercial SQL-based relational database system.

Larry finishes his breakfast quickly, leaves his house in his bright read Ferrari and drives along the 101, ever so slightly too fast but still so slow that the V12 equipped Ferrari is merely idling. He gets to the huge Oracle Parkway head office and heads up to his secretary, still carrying his expensive, elegant and lightweight leather briefcase. "Get all available execs to my office. My execs that is." His secretary gives him a puzzled look, what is going on, he looks so upbeat, she thinks, but in a very positive way, he seems to be gleaming with enthusiasm as he continues into his office.

Two hours later the available team of executives are gathered in Larrys office, some of them who hasn't been here many times before admire the view of the Bay Area through the glass wall behind Larrys back. Larry looks at them with a smile, something the team is not used to so they know something is about to happen. "I have an idea. Actually it is more than an idea, it is a plan and an order. Exactly how this plan will be implemented is up to you, but take it as an order that it is going to happen. If you don't agree with me, that is fine, there are many other companies out there that you can work for."

Larry now has the attention of everyone in the room, they look at him and wait for the next comment from him. Larry stays quiet for a while, and then he raises his voice ever so slightly. "If you don't want to be part of this big change, the biggest in the history of this company, then you are free to leave the room. There will be no hard feelings, I just want positive vibes, if you feel like being negative you are free to go somewhere else in the Valley. I will not oppose that and I wish you all the best if you don't want this. But make up your mind right now as after I have told you what we are going to do, I expect you to stay around and positively support the journey that we are about to embark on. OK".

The team of C-level executives look at each other, and then again at Larry, who keeps quiet. Then everyone looks at Larry again and all are still quiet. 30 seconds pass in silence, then Larry talks again, real quiet so everyone has to listen real hard. "SQL is about to die, and we are to ones that will kill it". The execs look at each other in amazement and Larry talks again "It stops right here, right now". Everyone's eyes open wide and Larry raises his voice "SQL sucks. SQL will be gone from all Oracle products within a year. No SQL support in Java, Oracle database will be gone and replaced. All SQL will go away and it starts right now."

The story above might be interesting, but it has very little to do with reality, this would never happen, right? If it DID happen, would it count as Disruption? Not really Evolution I guess no, I would really count this as Revolution and those don't happen often unless you happen to live a country in Central America I guess. But they DO happen even in the world of IT, let's roll back the time to the early 1960's, when the Vietnam war ongoing, JFK was still alive and mini-skirts were just around the corner. At the office of the biggest name in computing, IBM, there was a worry, but sales wasn't one of them, IBMs computers sold like hotcakes. Not, the issue was that IBM at the time had some seven different lines, yes seven, of computers, all incompatible in every conceivable way. It has to be understood that in those days, a computer was to a very large extent bespoke, software was mostly an afterthought and was also incompatible not only between machines from different IBM series of computers but also between different individual computers, if one had a different printer than the other, rewrite your code etc. At the same time, IBM computers were selling and the most popular among IBMs many series of computers was the mid-range 1401 series of which thousands were produced of this machine with roots in the 1950's. A key component was that the 1401 was rather complete, it did not only have CPU and memory but other parts were also part of the system such as tape drives, printers and cupholders (just kidding). This sounds obvious today but not so at the start of the swinging sixties. And a key component of the 1400-series was the 1403 line printer. This printer was the fastest in the world and it came with the 1401 computer. So if you wanted the best and fastest printer in the world, you had to but a 1401 computer, there was no other way. And if you had some other computer and you wanted to print on the 1403, then you had to put that on a tape or something that you brought (physically, by hand) to the 1401 to print on the 1403. A mess.

Something called the SPREAD group at IBM began to look into this issue of all these different architectures in 1960ish and they were ready by the end of 1961 when they produced a report. This report said something similar to what Larry said in my made up story above, scrap everything we have and start again. And IBM did just that, to everyones surprise. In 1964 the System/360 series was announced and it came to revolutionize computing. From a technology point of view, if wasn't spectacular, but from the point of view of how the technology was applied, is was spectacular. System/360 was a series to computers, small to large and as such could span the "whole circle" of computing needs, hence the 360-designation. And it came to change and influence IBM to this day and eventually wipe most of its then competitors off the map.

The "Larry Ellison" at IBM at that time was IBM CEO Thomas Watson Jr, the son of the founder of IBM, Thomas Watson. Watson was not someone you could decide not to involve in decisions like this and Watson was actually all for it, deciding to make all existing, and extremely profitable, IBM (and other) computers obsolete to be replaced by one single series of computers. This isn't 100% true though, but sort of. One innovation of the 360 was the large-scale implementation of "micro programming" where the CPU is actually in itself programmable, an idea that originally was the idea of British IT-pioneer Maurice Wilkes (right). The 360 was microprogrammed, and this allowed it be microprogrammed to run Doom. No, just kidding, System/360 was microprogrammed to emulate a 1401. This is one of those things that seemed like a good idea at the time, it allowed programs from the immensely popular 1401 computer to run on a 360, and not only that, the 360 was rather fast for it's time so with this emulation a 1401 program could actually run orders of magnitude faster on a Commodore 64, no wait, I mean System/360, than on an actual physical 1401.

So once the 360 was delivered, which took longer than anyone expected after the announcement as, and this was quite unexpected back then, writing the code for the operating system(s) took waaay longer than expected (see the book "The mythical man month" for details), everything was hanky panky (technical term meaning hanky with a slight touch of panky added to it). One minor issue remained, and stayed to this day, the 1401 emulation. Yepp, this turned out to be an issue in the long run as really old programs, dating back to the 1950's would still run. And run. And run. And this is why software that was developed in the 1950's continues to run to this day (I'm not kidding here) and this is also a reason to why the 360 and it's newer siblings continue to be used, to this day.

In summary, with the System/360 IBM took a bet, and they more or less bet the company. IBM was largely all computers at that time, the privious profitable punch card business was in quick decline. But the bet paid off, big time.

/Karlsson

Back from the dust-filled basement of computer history.

PS: System/360 introduced many new things that we stil use today and take for granted. The "byte" first came around as a term with System/360, which was 32-bit and 8 bits was convenient. That any printer works as well with any computer also started with System/360, although the common technology or printers being the #1 thing to drive sys admins crazy was still established at the time of the System/360 release and is a tradition that continues to this day.

Wednesday, November 22, 2023

Is the cloud dead? Nope.

So, my two cents on if the cloud is dead or not. (If Sun Microsystem was still around I would crack a joke on Sun being good with Clouds. But I don't).

I read that people start moving off the cloud and back on-prem and saving truckloads of money in the process, maybe enough for a gallon or two of gas so you can fill up your Ferrari (for example see https://basecamp.com/cloud-exit and https://shiftmag.dev/leaving-the-cloud-314/). On the other hand, I also see statistics showing that the cloud is growing, for example here: https://www.canalys.com/newsroom/global-cloud-services-q2-2023. What that last article says is that growth is slowing down, but it is still growing at a healthy rate.

So, what is all this about? I have some ideas of my own here, really (surprise!) and that what the rest of this artile is about. On one hand, if you move from on-prem to the cloud and think that this will be just like having all your computer cabinets, netwok switches etc. just like before, but have someone else managing the hardware, then you are likely going to fail and will not benefit from the #1 reason you likely went to the cloud to begin with: saving cost! Really, to benefit from the cloud you want to take advantage of how the cloud works, using the services provided by the cloud vendors, create and remove instances and other reasources as they are needed and prepare your application for an infrastructure that is moving and changing at times. If all you want is servers running linux on fixed IPs and have system that assume that all services are always available and are always, 24x7, in the same place and with the same attributes, maybe the cloud isn't for you.

This is not what happened to most of services that went back on-prem after having been on the cloud for a considerable time. The folks behind these services likely know exactly what they do and how their system work. They have systems that can sustain that services are gone or have moved, they don't assume that everything is always in the same place. They don't assume that scaling necessarily mean getting a bigger box, it can mean getting more boxes or something else. Often they have an application infrastructure largely running on containers (in some cases these smart guys and gals even put their database servers on some more firm ground (VMs) and ignored containers only for these).

In the latter cases we are talking about people who know stuff really well and actually do not mind managing IT infrastructure on their own. In this cases, there are again limited advantages to the cloud as they know how to do everything that a cloud provider can offer themselves and, just as important, they have the money and resources to do so.

Inbetween these two extremes, on one end those who has services, organization and / or a state of mind that is just to inflexible to benefit from the cloud and on the other end the super smart and resourceful geeks that can outsmart even Amazon, there is still a bunch of solid applications, companies and infrastructure that can benefit from the cloud. Maybe they are just a bit more flexible than the first group I mentioned and are willing to take the plunge and modernize what they have and to cloudify themselves, or maybe it is a company where IT is run by modern day Albert Einstein types and can outsmart anyone on the planet but they just cannot be bothered to mock around with IT hardware, OS patching and grand scale security implementations (if you ask me, this is the #1 reason to cloudify: having security looked after by someone else, and I am not claiming it gets more secure I just claim that I cannot be bothered with the mess of CA, SSL and Certificates and all that. What is a Certificate even?).

/Karlsson
Trying hard not to get a job that involves IT security

Friday, March 3, 2017

CONNECT BY is dead, long live CTE! In MariaDB 10.2!

Yes, you got that right, the old CONNECT BY as used by recursive SQL with Oracle has been replaced by Common Table Expressions, or the WITH statement in SQL:1999 which is now also available in MariaDB 10.2. Now, the SQL WITH construct, using Common Table Expressions or CTE, is useful for other things than just recursive queries, but this is the one feature that WITH enables that was previously very hard to do without some procedural code, the non-recursive use of Common Table Expressions could previously mostly be replaced by using temporary tables.

This blog post will explain what recursive SQL is all about and why this is useful, and I will show some examples of both CONNECT BY and how the same SQL is written using the WITH clause.

The most common example for recursive SQL is probably for doing a parts explosion, where we have a table of parts of some component where each part is either a main, top level, part or is a part of another part. For example a car with an engine, where the engine consists of pistons, cylinders and a camshaft, where the latter also includes some camshaft bearings. I think you get the basic idea here. To query this data to create a list of components that make up some other component, you need to recursively visit the data, i.e.. each row is evaluated using conditions from any other row already fetched, except the first row fetched that is.

Now, let's look at some data first. I assume we have two tables here, one table that contains information on the different parts and then one table that contains information on the individual parts and then one table that contains the hierarchy of the parts, called components. Like this:
CREATE TABLE parts(part_id INTEGER NOT NULL PRIMARY KEY,
part_name VARCHAR(60) NOT NULL);

CREATE TABLE components(comp_id INTEGER NOT NULL PRIMARY KEY,
comp_name VARCHAR(60),
comp_count INTEGER NOT NULL,
comp_part INTEGER NOT NULL,
comp_partof INTEGER,
FOREIGN KEY(comp_part) REFERENCES parts(part_id));
ALTER TABLE components ADD FOREIGN KEY(comp_partof) REFERENCES components(comp_id);

The two things to note here is that the components table has a column, comp_partof, that implements the hierarchy and that there is a self-referencing FOREIGN KEY constraint on this table. Given these tables, assuming that we are a small privately held car-manufacturing company in southern Germany, let's insert some data:
INSERT INTO parts VALUES(1, 'Car');
INSERT INTO parts VALUES(2, 'Bolt');
INSERT INTO parts VALUES(3, 'Nut');
INSERT INTO parts VALUES(4, 'V8 engine');
INSERT INTO parts VALUES(5, '6-cylinder engine');
INSERT INTO parts VALUES(6, '4-cylinder engine');
INSERT INTO parts VALUES(7, 'Cylinder block');
INSERT INTO parts VALUES(8, 'Cylinder');
INSERT INTO parts VALUES(9, 'Piston');
INSERT INTO parts VALUES(10, 'Camshaft');
INSERT INTO parts VALUES(11, 'Camshaft bearings');
INSERT INTO parts VALUES(12, 'Body');
INSERT INTO parts VALUES(13, 'Gearbox');
INSERT INTO parts VALUES(14, 'Chassie');
INSERT INTO parts VALUES(15, 'Rear axle');
INSERT INTO parts VALUES(16, 'Rear break');
INSERT INTO parts VALUES(17, 'Wheel');
INSERT INTO parts VALUES(18, 'Wheel bolts');

INSERT INTO components VALUES(1, '320', 1, 1, NULL);
INSERT INTO components VALUES(2, NULL, 1, 6, 1);
INSERT INTO components VALUES(3, NULL, 1, 7, 2);
INSERT INTO components VALUES(4, NULL, 4, 8, 3);
INSERT INTO components VALUES(5, NULL, 4, 9, 3);
INSERT INTO components VALUES(6, NULL, 1, 10, 3);
INSERT INTO components VALUES(7, NULL, 3, 11, 6);
INSERT INTO components VALUES(8, NULL, 1, 12, 1);
INSERT INTO components VALUES(9, NULL, 1, 14, 1);
INSERT INTO components VALUES(10, NULL, 1, 15, 9);
INSERT INTO components VALUES(11, NULL, 2, 16, 10);

INSERT INTO components VALUES(12, '323 i', 1, 1, NULL);
INSERT INTO components VALUES(13, NULL, 1, 5, 12);
If you are not into mechanics, let me tell you that there are more parts than this to a car, for example I left out a few critical components, such as the cupholder, the dog that stands on the pickup cargo area and the insulting bumber-sticker, but I think you get the idea. Note that there are two "main" components, the '320' and '323 i' and that these are top level components are indicated by the comp_partof column being set to NULL.

Now, assume you want to list all the parts that make up a 320. The way this works when using the CONNECT BY syntax, you compose one single SQL statement and provide a CONNECT BY clause to indicate the relationship. Like this:
SELECT LPAD('-', level, '-')||'>' level_text, comp_count, NVL(comp_name, part_name) name
FROM components c, parts p
WHERE c.comp_part = p.part_id
START WITH c.comp_name = '320'
CONNECT BY PRIOR c.comp_id = c.comp_partof;

Let me explain this a bit, but there is nothing really magic here. We are selecting from the two tables and joining them just as usual. Then we use the START WITH clause to define the top level component and then the rest of the components are have a comp_partof of a component that matches the comp_id of the START WITH component or a comp_id of any other component that has been fetched.
This way of writing recursive SQL has some advantages, such as it is relatively compact and is easy to understand. The disadvantage is that there are some quirks and limitation to this and that once your queries gets more complex, CONNECT BY gets a bit hairy. One sure sign that CONNECT BY is going away, even though I and many others tend to like it because of the ease of use, is that even Oracle, as of Oracle 11g, also has implemented the WITH construct, or Common Table Expressions or CTE. So looking at the above statement how this would work in MariaDB 10.2, this is what it would look like using the WITH construct:
WITH RECURSIVE comp(comp_id, comp_name, comp_partof, comp_count) AS (
SELECT comp_id, comp_name, comp_partof, comp_count
    FROM components JOIN parts ON comp_part = part_id
    WHERE comp_partof IS NULL AND comp_name = '320'
UNION ALL
SELECT c1.comp_id, p.part_name, c1.comp_partof, c1.comp_count
FROM components c1 JOIN parts p ON c1.comp_part = p.part_id
    JOIN comp c2 ON c1.comp_partof = c2.comp_id)
SELECT comp_count, comp_name FROM comp;

Comparing this CTE version to the CONNECT BY version as above, this is a bit more complex, but how it works is actually pretty clear once you look at it carefully. To begin with, the top level item or anchor is the first SELECT in the UNION ALL and the following components are fetched using the second SELECT. Then the recursive aspect is handled by this UNION being run until there are no more rows returned from it? As you can see, although this requires more text and more complex SQL to write, it is also a fair bit more flexible. For example, the anchor point is defined by a completely separate SELECT which means it can be whatever SELECT you want, selecting from any odd table. Secondly, the column you use and the conditions for defining the hierarchy can be as complex as you want. And thirdly, there is also the power of that last SELECT which in the case above just gets the data from the UNION, but you can actually apply any kind of filter, ordering or column filter to this query. The result of the query above is this:
comp_count      comp_name
1               320
1               4-cylinder engine
1               Body
1               Chassie
1               Cylinder block
1               Rear axle
4               Cylinder
4               Piston
1               Camshaft
2               Rear break
3               Camshaft bearings

Before I finish this off, the WITH RECURSIVE statement is somewhat overly complex, in MariaDB 10.2 you can for example skip listing the column names of the recursive table, like this:
WITH RECURSIVE comp AS (
SELECT comp_id, comp_name, comp_partof, comp_count
    FROM components JOIN parts ON comp_part = part_id
    WHERE comp_partof IS NULL AND comp_name = '320'
UNION ALL
SELECT c1.comp_id, p.part_name, c1.comp_partof, c1.comp_count
FROM components c1 JOIN parts p ON c1.comp_part = p.part_id
    JOIN comp c2 ON c1.comp_partof = c2.comp_id)
SELECT comp_count, comp_name FROM comp;

And although Oracle 11 and up supports the CTEs, it works a bit differently. For one thing, the RECURSIVE keyword isn't supported (it is assumed to be recursive by default) and the way I read the SQL standard, this is actually wrong, for recursive queries you have to use the RECURSIVE keyword. Second, Oracle does require the SELECT-list. So in Oracle, you would see something like this:
WITH comp(comp_id, comp_name, comp_partof, comp_count) AS (
SELECT comp_id, comp_name, comp_partof, comp_count
    FROM components JOIN parts ON comp_part = part_id
    WHERE comp_partof IS NULL AND comp_name = '320'
UNION ALL
SELECT c1.comp_id, p.part_name, c1.comp_partof, c1.comp_count
FROM components c1 JOIN parts p ON c1.comp_part = p.part_id
    JOIN comp c2 ON c1.comp_partof = c2.comp_id)
SELECT comp_count, comp_name FROM comp;
Yes, we are all happily following the same SQL standard. Somewhat...
See the MariaDB Knowledge Base for more information on common table expressions.

Happy SQL'ing
/Karlsson

Tuesday, February 28, 2017

JSON with MariaDB 10.2

JSON is fast becoming the standard format for data interchange and for unstructured data, and MariaDB 10.2 adds a range on JSON supporting functions, even though a JSON datatype isn't implemented yet. There are some reasons why there isn't a JSON datatype, but one is that there are actually not that many advantages to that as JSON is a text-based format. This blog post aims to describe JSON and the use cases for it, as well as to describe the MariaDB 10.2 JSON functions and uses for these, as well as showing some other additions to MariaDB 10.2 that are useful for JSON processing.

So to begin with then, why do we need JSON? Or to put it differently, why do we not store all data in JSON? Well, the reason as I see it is that some data we work with really is best treated as schemaless whereas some other data really should be handled in a more strict way in a schema. Which means that in my mind mixing relational data with unstructured data is what we really want. And using JSON for unstructured data is rather neat, and JSON is even standardized (see json.org).

There are reasons why this hasn't always been so. When the sad old git that is writing this stuff started working in this industry, which I think was during the Harding administration, computers were rare, expensive, handled only by experts (so how I got to work with them is a mystery) and built from lego-bricks, meccano and pieces of solid gold (to keep up the price). Also, they were as powerful as a slide-ruler, except it was feed with punched-cards (and probably powered by steam). Anyway, no one in their right mind would have considered string pictures of cute felines as something to be on a computer, or actually stuff to be stored in the database. The little that would fit was the really important stuff - like price, amount in stock, customer name, billing address and such - and nothing else. And not only that, stuff that was stored had some kind of value, somehow, which meant it had to follow certain rules (and following rules is something I am good at? I wonder how I ended up in this business. Again). Like, a price had to be a number of some kind, with a value 0 or higher and some other restrictions. As you see, these were hard and relentless times.

And then time moved on and people started buying things on the internet (whatever the internet is. I think it is some kind of glorified, stylish version of punched cards) and stuff such as Facebook and Google came around. The issue with computer storage was now not how to fit all that nicely structured data in it, but rather once we have filled that hard drive on your desktop with all the product, customers and transactions from Amazon (I think Amazon has something to do with Internet, but I am not sure) and a full 17.85% of that drive is now occupied by that old-style structured data, what more do we put in there? Maybe we could put some more information on the items for sale in that database, and some general information on who is buying it? That should fill up that disk nicely, right? Well, yes, but that new data, although related to the structured data I already have, is largely unstructured. Say, for example, you write a review of a product on Amazon late in the morning after a good deal of heavy "partying" (which is not an Internet thing, I think), the contents of that would hardly be considered "structured". If you didn't like the product (which you probably didn't), then the appropriate terms for large parts of that review would probably be "profanity" or "foul language").

The way to deal with the above is a mix of structured and unstructured data, with some kind of relation between the two. Like a column of unstructured data in each relational database table (or should I say "relation", just to show my age? Or maybe I should pretend to be really young, modern and cool, possibly sporting a hipster beard and all that, by calling it a "collection").

With that out of the way, let's consider an example using structured as well as non-structured JSON data. Assume we have a store selling different types of clothing, pants, jackets, shoes and we are to create a table to hold the inventory. This table would have some columns that are always there and which have the same meaning for all rows in the table, like name, amount in stock and price. These are items that are well suited for a relational format. On top of this we want to add attributes that have different meaning for each type of or even each instance of items. Here we have things like colour, width, length and size. These we consider non-relational as the interpretation of these attributes are different depending of the type of garment (like size M or shoe sizes or a "zebra striped" colour) and some garments might have some unique attribute, like designer or recommended by staff or something. Our table might then look something like this:
MariaDB> CREATE TABLE products(id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255) NOT NULL,
price DECIMAL(9,2) NOT NULL,
stock INTEGER NOT NULL,
attr VARCHAR(1024));
In this table we have a few columns that look like columns in any relational database table, and then we have a column, called attr, that can hold any relevant attribute for the garment in question and we will store that as JSON a JSON string. You probably notice that we aren't using a JSON datatype here as that is not present in MariaDB, despite that there are JSON functions, but those JSON functions act on a text-string with JSON content. These functions are introduced in MariaDB 10.2 (which is in Beta as I write this), but there are a few bugs that means you should use MariaDB 10.2.4 or higher, which means as for now we assume that MariaDB 10.2.4 or higher is being used.

But there is one issue with the above that I don't particularly care for and that is, as the attr column is plain text, any kind of data can be put in the attr column, even non-valid JSON. The good thing is that there is a fix for this in MariaDB 10.2, which is CHECK constraints that actually work, and this is a little discussed feature of MariaDB 10.2. The way this works is that this kind of constraint kicks in whenever a row is INSERTed or UPDATEed, any CHECK constraint runs and validates the data and if the validation fails the operation also fails. Before I show an example I just want to mention one JSON function we are to use here, which is JSON_VALID which takes a string and checks if it is valid JSON. Note that although CHECK constraints are particularly valid here, check constraints can be used for any kind of data validation.

Armed with this, let's rewrite the statement that creates the table like this:
MariaDB> CREATE TABLE products(id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255) NOT NULL,
price DECIMAL(9,2) NOT NULL,
stock INTEGER NOT NULL,
attr VARCHAR(1024),
CHECK (JSON_VALID(attr)));

Let's give this a try now:
MariaDB> INSERT INTO products VALUES(NULL, 'Jeans', 10.5, 165, NULL);
ERROR 4025 (23000): CONSTRAINT `CONSTRAINT_1` failed for `inventory`.`products`
Ok, that didn't work out. What happens here is that a NULL string isn't a valid JSON value, so we need to rewrite our table definition:
MariaDB> CREATE TABLE products(id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255) NOT NULL,
price DECIMAL(9,2) NOT NULL,
stock INTEGER NOT NULL,
attr VARCHAR(1024),
CHECK (attr IS NULL OR JSON_VALID(attr)));
Following this we can try it again:
MariaDB> INSERT INTO products VALUES(NULL, 'Jeans', 10.5, 165, NULL);
Query OK, 1 row affected (0.01 sec)
MariaDB> INSERT INTO products VALUES(NULL, 'Shirt', 10.5, 78, '{"size": 42, "colour": "white"}');
Query OK, 1 row affected (0.01 sec)
MariaDB> INSERT INTO products VALUES(NULL, 'Blouse', 17, 15, '{"colour": "white}');
ERROR 4025 (23000): CONSTRAINT `CONSTRAINT_1` failed for `inventory`.`products`
That last statement failed because of malformed JSON (a double quote was forgotten about), so let's correct that:
MariaDB> INSERT INTO products VALUES(NULL, 'Blouse', 17, 15, '{"colour": "white"}');
Query OK, 1 row affected (0.01 sec)
One thing that has yet to be discussed is indexes on JSON values. As the attr column in our example is a plain text, we can of course index it as usual, but that is probably not what you want to do, rather what would be neat would be to create an index on individual attributes in that JSON string. MariaDB doesn't yet support functional indexes, i.e. functions not on values but on computed values. What MariaDB does have though is Virtual Columns, and these can be indexed and as of MariaDB 10.2 these virtual columns don't have to be persistent, (read more on Virtual Columns here: Putting Virtual Columns to good use).

The easiest way to explain this is with an example. Let's say we want an index on the colour attribute, if such a thing exists. For this we need two things: A virtual column that contains the colour attribute as extracted from the attr column, and then an index on that. In this case we will be using the JSON_VALUE function that takes a JSON value and a path, the latter describing the JSON operation to be performed, somewhat like a query language for JSON.

We end up with something like this:
MariaDB> ALTER TABLE products ADD attr_colour VARCHAR(32) AS (JSON_VALUE(attr, '$.colour'));
MariaDB> CREATE INDEX products_attr_colour_ix ON products(attr_colour);
With that in place, let's see how that works:
MariaDB> SELECT * FROM products WHERE attr_colour = 'white';
+----+--------+-------+-------+---------------------------------+-------------+
| id | name   | price | stock | attr                            | attr_colour |
+----+--------+-------+-------+---------------------------------+-------------+
| 2 | Shirt | 10.50 |    78 | {"size": 42, "colour": "white"} | white       |
| 3 | Blouse | 17.00 |    15 | {"colour": "white"}             | white       |
+----+--------+-------+-------+---------------------------------+-------------+
2 rows in set (0.00 sec)
And let's see if that index is working as it should:
MariaDB> EXPLAIN SELECT * FROM products WHERE attr_colour = 'white';
+------+-------------+----------+------+-------------------------+-------------------------+---------+-------+------+------------+
| id   | select_type | table    | type | possible_keys           | key                     | key_len | ref   | rows | Etra       |
+------+-------------+----------+------+-------------------------+-------------------------+---------+-------+------+------------+
|    1 | SIMPLE      | products | ref | products_attr_colour_ix | products_attr_colour_ix | 99      | const |    2 | Uing where |
+------+-------------+----------+------+-------------------------+-------------------------+---------+-------+------+------------+
1 row in set (0.00 sec)

And just to show that the column attr_colour is a computed column that depends on the attr column, lets try updating the colour for the blouse and make that red instead of white and then search that. To replace a value in a JSON object MariaDB 10.2 provides the JSON_REPLACE functions (for all JSON functions in MariaDB 10.2 see MariaDB Knowledge Base).
MariaDB> UPDATE products SET attr = JSON_REPLACE(attr, '$.colour', 'red') WHERE name = 'Blouse';
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
MariaDB> SELECT attr_colour FROM products WHERE name = 'blouse';
+-------------+
| attr_colour |
+-------------+
| red         |
+-------------+
1 row in set (0.00 sec)

There is more to say about JSON in MariaDB 10.2 but I hope you now have a feel on what's for offer.

Happy SQL'ing
/Karlsson