Skip to main content

· 2 min read

Requirements

The VPC must be located in one of our ClickPipes regions: us-east-1, us-east-2 or eu-central-1. (https://clickhouse.com/docs/en/integrations/clickpipes#list-of-static-ips)

Follow these steps to create a VPC endpoint service for your RDS instance. Repeat these steps if you have multiple instances that require endpoint services:

  1. Locate Your VPC and Create an NLB

    • Navigate to your target VPC and create a Network Load Balancer (NLB).
  2. Configure the Target Group

    • The target group should point to the RDS instance's endpoint IP and Port (typically 5432 for PostgreSQL or 3306 for MySQL).
    • Ensure that the TCP protocol is used to avoid TLS termination by the NLB.
    • IMPORTANT: Make sure the RDS instance endpoint used in case of DB Cluster/Aurora is ONLY the WRITER Endpoint and NOT the common endpoint.
  3. Set the Listener Port

    • The listener port of the load balancer must match the port used by the target group (typically 5432 for PostgreSQL or 3306 for MySQL).
  4. Ensure the Load Balancer is Private

    • Configure the NLB to be private, ensuring it is only accessible within the VPC.
  5. Create the VPC Endpoint Service

    • In the VPC, create an endpoint service that points to the NLB.
    • Enable acceptance of connection requests from specific accounts.
  6. Authorize ClickPipes to Use the Endpoint Service

    • Grant permission to the ClickPipes account to request this endpoint service.
    • Configure allowed principals by adding the following principal ID:
      arn:aws:iam::072088201116:root

Initiating connection

When it's done, share details such as private DNS name, VPC service name and availability zone. ClickPipes team will initiate VPC endpoints creation in ClickPipes VPC. This will require connection request acceptance on your side.

Creating ClickPipes

Use your RDS's private DNS endpoints to create your ClickPipes.

· 2 min read

This is a step by step example on how to start using Python with ClickHouse Cloud service.

Note

Keep in mind that Python versions and libraries dependencies are constantly evolving. Make also sure to use the latest supported versions of both the driver and Python environment when trying this.

At the time of writing this article, we're using the clickhouse-connect driver version 0.5.23 and python 3.11.2 respectively.

Steps

  1. Check the Python version:
$  python -V
Python 3.11.2
  1. We'll assemble the project in a folder called ch-python:
$ mkdir ch-python
$ cd ch-python
  1. Create a dependencies file named requirements.txt with:
clickhouse-connect==0.5.23
  1. Create a python source file named main.py:
import clickhouse_connect
import sys
import json

CLICKHOUSE_CLOUD_HOSTNAME = 'HOSTNAME.clickhouse.cloud'
CLICKHOUSE_CLOUD_USER = 'default'
CLICKHOUSE_CLOUD_PASSWORD = 'YOUR_SECRET_PASSWORD'

client = clickhouse_connect.get_client(
host=CLICKHOUSE_CLOUD_HOSTNAME, port=8443, username=CLICKHOUSE_CLOUD_USER, password=CLICKHOUSE_CLOUD_PASSWORD)

print("connected to " + CLICKHOUSE_CLOUD_HOSTNAME + "\n")
client.command(
'CREATE TABLE IF NOT EXISTS new_table (key UInt32, value String, metric Float64) ENGINE MergeTree ORDER BY key')

print("table new_table created or exists already!\n")

row1 = [1000, 'String Value 1000', 5.233]
row2 = [2000, 'String Value 2000', -107.04]
data = [row1, row2]
client.insert('new_table', data, column_names=['key', 'value', 'metric'])

print("written 2 rows to table new_table\n")

QUERY = "SELECT max(key), avg(metric) FROM new_table"

result = client.query(QUERY)

sys.stdout.write("query: ["+QUERY + "] returns:\n\n")
print(result.result_rows)
  1. Create the virtual environment:
chpython$ python -m venv venv
  1. Load the virtual environment:
chpython$ source venv/bin/activate

Once loaded, your terminal prompt should be prefixed with (venv), install dependencies:

(venv) ➜  chpython$ pip install -r requirements.txt
Collecting certifi
Using cached certifi-2023.5.7-py3-none-any.whl (156 kB)
Collecting urllib3>=1.26
Using cached urllib3-2.0.2-py3-none-any.whl (123 kB)
Collecting pytz
Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting zstandard
Using cached zstandard-0.21.0-cp311-cp311-macosx_11_0_arm64.whl (364 kB)
Collecting lz4
Using cached lz4-4.3.2-cp311-cp311-macosx_11_0_arm64.whl (212 kB)
Installing collected packages: pytz, zstandard, urllib3, lz4, certifi, clickhouse-connect
Successfully installed certifi-2023.5.7 clickhouse-connect-0.5.23 lz4-4.3.2 pytz-2023.3 urllib3-2.0.2 zstandard-0.21.0
  1. Launch the code!
(venv) chpython$ venv/bin/python main.py

connected to HOSTNAME.clickhouse.cloud

table new_table created or exists already!

written 2 rows to table new_table

query: [SELECT max(key), avg(metric) FROM new_table] returns:

[(2000, -50.9035)]
Tip

If using an older Python version (e.g. 3.9.6) you might be getting an ImportError related to urllib3 library. In that case either upgrade your Python environment to a newer version or pin the urllib3 version to 1.26.15 in your requirements.txt file.

· 5 min read

How can I use API to manage clusters on ClickHouse Cloud?

Answer

We will use Terraform to configure our infra and ClickHouse Provider

Steps:

1). Create an API Key on Cloud. Follow the docs here - https://clickhouse.com/docs/en/cloud/manage/openapi

Save the creds locally.

2). Install Terraform using - https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli

You can use Homebrew package manager if you're on Mac.

3). Create a directory anywhere you like:

mkdir test
➜ test pwd
/Users/jaijhala/Desktop/terraform/test

4). Create 2 files: main.tf and secret.tfvars

Copy the following:

main.tf file would be:

terraform {
required_providers {
clickhouse = {
source = "ClickHouse/clickhouse"
version = "0.0.2"
}
}
}

variable "organization_id" {
type = string
}

variable "token_key" {
type = string
}

variable "token_secret" {
type = string
}

provider clickhouse {
environment = "production"
organization_id = var.organization_id
token_key = var.token_key
token_secret = var.token_secret
}


variable "service_password" {
type = string
sensitive = true
}

resource "clickhouse_service" "service123" {
name = "jai-terraform"
cloud_provider = "aws"
region = "us-east-2"
tier = "development"
idle_scaling = true
password = var.service_password
ip_access = [
{
source = "0.0.0.0/0"
description = "Anywhere"
}
]
}

output "CLICKHOUSE_HOST" {
value = clickhouse_service.service123.endpoints.0.host
}

You can replace your own parameters like service name, region etc.. in the resources section above.

secret.tfvars is where you'll put all the API Key related info that you downloaded earlier. The idea behind this file is that all your secret credentials will be hidden from the main config file.

It would be something like (replace these parameters):

organization_id = "e957a5f7-4qe3-4b05-ad5a-d02b2dcd0593"
token_key = "QWhhkMeytqQruTeKg"
token_secret = "4b1dNmjWdLUno9lXxmKvSUcPP62jvn7irkuZPbY"
service_password = "password123!"

5). Run terraform init from this directory

Expected output:

Initializing the backend...

Initializing provider plugins...
- Finding clickhouse/clickhouse versions matching "0.0.2"...
- Installing clickhouse/clickhouse v0.0.2...
- Installed clickhouse/clickhouse v0.0.2 (self-signed, key ID D7089EE5C6A92ED1)

Partner and community providers are signed by their developers.
If you'd like to know more about provider signing, you can read about it here:
https://www.terraform.io/docs/cli/plugins/signing.html

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

6). Run terraform apply -var-file=secret.tfvars command.

Something like:

➜  test terraform apply -var-file=secret.tfvars

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with
the following symbols:
+ create

Terraform will perform the following actions:

# clickhouse_service.service123 will be created
+ resource "clickhouse_service" "service123" {
+ cloud_provider = "aws"
+ endpoints = (known after apply)
+ id = (known after apply)
+ idle_scaling = true
+ ip_access = [
+ {
+ description = "Anywhere"
+ source = "0.0.0.0/0"
},
]
+ last_updated = (known after apply)
+ name = "jai-terraform"
+ password = (sensitive value)
+ region = "us-east-2"
+ tier = "development"
}

Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
+ CLICKHOUSE_HOST = (known after apply)

Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.

Enter a value: yes

Type yes and hit enter

Side note: Notice it says password = (sensitive value) above. This is because we set sensitive = true for the password in the main.tf file.

7). It will take a couple of mins to create the service but eventually it should come up like:

  Enter a value: yes

clickhouse_service.service123: Creating...
clickhouse_service.service123: Still creating... [10s elapsed]
clickhouse_service.service123: Still creating... [20s elapsed]
clickhouse_service.service123: Still creating... [30s elapsed]
clickhouse_service.service123: Still creating... [40s elapsed]
clickhouse_service.service123: Still creating... [50s elapsed]
clickhouse_service.service123: Still creating... [1m0s elapsed]
clickhouse_service.service123: Still creating... [1m10s elapsed]
clickhouse_service.service123: Still creating... [1m20s elapsed]
clickhouse_service.service123: Still creating... [1m30s elapsed]
clickhouse_service.service123: Still creating... [1m40s elapsed]
clickhouse_service.service123: Creation complete after 1m41s [id=aa8d8d63-1878-4600-8470-630715af38ed]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

CLICKHOUSE_HOST = "h3ljlaqez6.us-east-2.aws.clickhouse.cloud"
➜ test

8). Check Cloud Console, you should be able to see the service created.

9). To clean up/destroy the service again, run terraform destroy -var-file=secret.tfvars

Something like:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with
the following symbols:
- destroy

Terraform will perform the following actions:

# clickhouse_service.service123 will be destroyed
- resource "clickhouse_service" "service123" {
- cloud_provider = "aws" -> null
- ............

Plan: 0 to add, 0 to change, 1 to destroy.

Changes to Outputs:
- CLICKHOUSE_HOST = "h3ljlaqez6.us-east-2.aws.clickhouse.cloud" -> null

Do you really want to destroy all resources?
Terraform will destroy all your managed infrastructure, as shown above.
There is no undo. Only 'yes' will be accepted to confirm.

Enter a value:

Type yes and hit enter

10).

clickhouse_service.service123: Destroying... [id=aa8d8d63-1878-4600-8470-630715af38ed]
clickhouse_service.service123: Still destroying... [id=aa8d8d63-1878-4600-8470-630715af38ed, 10s elapsed]
clickhouse_service.service123: Still destroying... [id=aa8d8d63-1878-4600-8470-630715af38ed, 20s elapsed]
clickhouse_service.service123: Destruction complete after 27s

Destroy complete! Resources: 1 destroyed.

And it should be gone from the Cloud Console.

More details about the Cloud API can be found here - https://clickhouse.com/docs/en/cloud/manage/api/api-overview

· 4 min read

In this guide, we'll learn how to add a column to an existing table. We'll be using clickhouse-local:

clickhouse -m --output_format_pretty_row_numbers=

Let's imagine we have the following table:

CREATE TABLE events (
date Date DEFAULT today(),
name String
)
ENGINE = MergeTree
ORDER BY date;

Let's add one record:

INSERT INTO events (name) VALUES ('Alexey');

And now query the events table:

SELECT *
FROM events;
┌───────date─┬─name───┐
│ 2024-12-18 │ Alexey │
└────────────┴────────┘

Adding a new column

Now let's say we're going to add a new column called favoriteNumber, which will be a Float64. We can do this using the ALTER TABLE...ADD COLUMN clause:

ALTER TABLE events 
ADD COLUMN favoriteNumber Float64 DEFAULT 7;

If we query the events table, we'll see the following output:

┌───────date─┬─name───┬─favoriteNumber─┐
│ 2024-12-18 │ Alexey │ 7 │
└────────────┴────────┴────────────────┘

The Alexey row defaults to 7 since that column didn't exist when we added that row. Next, let's add another column:

INSERT INTO events (name) VALUES ('Tyler');

If we query the events table, we'll see the following output:

┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ date ┃ name ┃ favoriteNumber ┃
┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ 2024-12-18 │ Tyler │ 7 │
├────────────┼────────┼────────────────┤
│ 2024-12-18 │ Alexey │ 7 │
└────────────┴────────┴────────────────┘

Modifying a column's default value

If we modify the favoriteNumber column to have a different type using the ALTER TABLE...MODIFY COLUMN clause, things get interesting:

ALTER TABLE events 
MODIFY COLUMN favoriteNumber Float64 DEFAULT 99;

If we query events again, we'll see this output:

┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ date ┃ name ┃ favoriteNumber ┃
┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ 2024-12-18 │ Tyler │ 7 │
├────────────┼────────┼────────────────┤
│ 2024-12-18 │ Alexey │ 99 │
└────────────┴────────┴────────────────┘

Tyler keeps a value of 7, which was the default when that row was created. Alexey picks up the new default of 99 because the favoriteNumber column didn't exist when that row was created.

If we want the Alexey row to use the current default right away, we need to call OPTIMIZE TABLE to force current defaults to be written to disk:

OPTIMIZE TABLE events;

Once we've done that, let's say we change the default value again:

ALTER TABLE events 
MODIFY COLUMN favoriteNumber Float64 DEFAULT 21;

And then insert another row:

INSERT INTO events (name) VALUES ('Tanya');

Finally, let's query events one more time:

┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ date ┃ name ┃ favoriteNumber ┃
┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ 2024-12-18 │ Alexey │ 99 │
├────────────┼────────┼────────────────┤
│ 2024-12-18 │ Tyler │ 7 │
├────────────┼────────┼────────────────┤
│ 2024-12-18 │ Tanya │ 21 │
└────────────┴────────┴────────────────┘

Tanya picks up the new default of 21, but Alexey has the old default of 99.

Controlling column position in table

When we add a new column, by default it will be added at the end of the table. But, we can use the FIRST and AFTER clauses to control where a column is positioned.

For example, if we wanted to add a column called favoriteColor after the name column, we could do this:

ALTER TABLE events
ADD COLUMN favoriteColor String DEFAULT 'Yellow' AFTER name;

Let's query events:

┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ date ┃ name ┃ favoriteColor ┃ favoriteNumber ┃
┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ 2024-12-18 │ Alexey │ Yellow │ 99 │
├────────────┼────────┼───────────────┼────────────────┤
│ 2024-12-18 │ Tyler │ Yellow │ 7 │
├────────────┼────────┼───────────────┼────────────────┤
│ 2024-12-18 │ Tanya │ Yellow │ 21 │
└────────────┴────────┴───────────────┴────────────────┘

And if we wanted to add a column favoriteDatabase and have that be first in the list, we could do this:

ALTER TABLE events
ADD COLUMN favoriteDatabase String DEFAULT 'ClickHouse' FIRST;
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ favoriteDatabase ┃ date ┃ name ┃ favoriteColor ┃ favoriteNumber ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ ClickHouse │ 2024-12-18 │ Tanya │ Yellow │ 21 │
├──────────────────┼────────────┼────────┼───────────────┼────────────────┤
│ ClickHouse │ 2024-12-18 │ Alexey │ Yellow │ 99 │
├──────────────────┼────────────┼────────┼───────────────┼────────────────┤
│ ClickHouse │ 2024-12-18 │ Tyler │ Yellow │ 7 │
└──────────────────┴────────────┴────────┴───────────────┴────────────────┘

And let's have a look at the table definition:

SHOW CREATE TABLE events
FORMAT LineAsString
CREATE TABLE default.`clickhouse-local-ab404c86-56cc-495b-ad1d-fb343cac3bc0events`
(
`favoriteDatabase` String DEFAULT 'ClickHouse',
`date` Date DEFAULT today(),
`name` String,
`favoriteColor` String DEFAULT 'Yellow',
`favoriteNumber` Float64 DEFAULT 21
)
ENGINE = MergeTree
ORDER BY date
SETTINGS index_granularity = 8192

· 5 min read

Question

How do I import GeoJSON with a nested object array?

Answer

For this tutorial, we will use open data publicly available here. A copy can be found here.

  1. Download the data in GeoJSON format and rename the file to geojson.json.
  2. Understand the structure.
DESCRIBE TABLE file('geojson.json', 'JSON')
┌─name─────┬─type─────────────────────────────────────────────────────────────────────────────────────────┐
type │ Nullable(String)
│ name │ Nullable(String)
│ crs │ Tuple( properties Tuple(name Nullable(String)),type Nullable(String))
│ features │ Array(Tuple(
│ │ geometry Tuple(coordinates Array(Array(Array(Array(Nullable(Float64))))),
│ │ type Nullable(String)),
│ │ properties Tuple( CODIGOINE Nullable(String),
│ │ CODNUT1 Nullable(String),
│ │ CODNUT2 Nullable(String),
│ │ CODNUT3 Nullable(String),
│ │ FID Nullable(Int64),
│ │ INSPIREID Nullable(String),
│ │ NAMEUNIT Nullable(String),
│ │ NATCODE Nullable(String),
│ │ SHAPE_Area Nullable(Float64),
│ │ SHAPE_Length Nullable(Float64)
│ │ ),
│ │ type Nullable(String)
│ │ )
│ │ )
└──────────┴──────────────────────────────────────────────────────────────────────────────────────────────┘
  1. Create a table to store the GeoJSON rows.

The requirement here is to generate a row for each object in the features array. The data type inferred for the field geometry suggests that it translates to ClickHouse's MultiPolygon data type.

create table geojson 
(
type String,
name String,
crsType String,
crsName String,
featureType String,
id Int64,
inspiredId String,
natCode String,
nameUnit String,
codNut1 String,
codNut2 String,
codNut3 String,
codigoIne String,
shapeLength Float64,
shapeArea Float64,
geometryType String,
geometry MultiPolygon
)
engine = MergeTree
order by id;
  1. Prepare the data. The main purpose of the query is to verify that we obtain one row for each object in the features array.

The field features.geometry.coordinates is commented to make the result set more readable.

SELECT
type AS type,
name AS name,
crs.type AS crsType,
crs.properties.name AS crsName,
features.type AS featureType,
features.properties.FID AS id,
features.properties.INSPIREID AS inspiredId,
features.properties.NATCODE AS natCode,
features.properties.NAMEUNIT AS nameUnit,
features.properties.CODNUT1 AS codNut1,
features.properties.CODNUT2 AS codNut2,
features.properties.CODNUT3 AS codNut3,
features.properties.CODIGOINE AS codigoIne,
features.properties.SHAPE_Length AS shapeLength,
features.properties.SHAPE_Area AS shapeArea,
features.geometry.type AS geometryType
--,features.geometry.coordinates
FROM file('municipios_ign.geojson', 'JSON')
ARRAY JOIN features
LIMIT 5

┌─type──────────────┬─name───────────┬─crsType─┬─crsName───────────────────────┬─featureType─┬─id─┬─inspiredId───────────────┬─natCode─────┬─nameUnit──────────────┬─codNut1─┬─codNut2─┬─codNut3─┬─codigoIne─┬────────shapeLength─┬─────────────shapeArea─┬─geometryType─┐
│ FeatureCollection │ Municipios_IGN │ name │ urn:ogc:def:crs:OGC:1.3:CRS84 │ Feature │ 1 │ ES.IGN.SIGLIM34081616266 │ 34081616266 │ Villarejo-Periesteban │ ES4 │ ES42 │ ES423 │ 162660.26974769973041210.0035198414406406673MultiPolygon
│ FeatureCollection │ Municipios_IGN │ name │ urn:ogc:def:crs:OGC:1.3:CRS84 │ Feature │ 2 │ ES.IGN.SIGLIM34081616269 │ 34081616269 │ Villares del Saz │ ES4 │ ES42 │ ES423 │ 162690.44760839012699050.00738179315030249MultiPolygon
│ FeatureCollection │ Municipios_IGN │ name │ urn:ogc:def:crs:OGC:1.3:CRS84 │ Feature │ 3 │ ES.IGN.SIGLIM34081616270 │ 34081616270 │ Villarrubio │ ES4 │ ES42 │ ES423 │ 162700.30539422739941790.0029777582813496337MultiPolygon
│ FeatureCollection │ Municipios_IGN │ name │ urn:ogc:def:crs:OGC:1.3:CRS84 │ Feature │ 4 │ ES.IGN.SIGLIM34081616271 │ 34081616271 │ Villarta │ ES4 │ ES42 │ ES423 │ 162710.28312269798211840.002680273189024594MultiPolygon
│ FeatureCollection │ Municipios_IGN │ name │ urn:ogc:def:crs:OGC:1.3:CRS84 │ Feature │ 5 │ ES.IGN.SIGLIM34081616272 │ 34081616272 │ Villas de la Ventosa │ ES4 │ ES42 │ ES423 │ 162720.59582767492467770.015354885085133583MultiPolygon
└───────────────────┴────────────────┴─────────┴───────────────────────────────┴─────────────┴────┴──────────────────────────┴─────────────┴───────────────────────┴─────────┴─────────┴─────────┴───────────┴────────────────────┴───────────────────────┴──────────────┘
  1. Insert the data.
INSERT INTO geojson
SELECT
type AS type,
name AS name,
crs.type AS crsType,
crs.properties.name AS crsName,
features.type AS featureType,
features.properties.FID AS id,
features.properties.INSPIREID AS inspiredId,
features.properties.NATCODE AS natCode,
features.properties.NAMEUNIT AS nameUnit,
features.properties.CODNUT1 AS codNut1,
features.properties.CODNUT2 AS codNut2,
features.properties.CODNUT3 AS codNut3,
features.properties.CODIGOINE AS codigoIne,
features.properties.SHAPE_Length AS shapeLength,
features.properties.SHAPE_Area AS shapeArea,
features.geometry.type AS geometryType,
features.geometry.coordinates as geometry
FROM file('municipios_ign.geojson', 'JSON')
ARRAY JOIN features

Here, we get the following error:

Received exception from server (version 24.1.2): Code: 53. DB::Exception: Received from localhost:9000. DB::Exception: ARRAY JOIN requires array or map argument. (TYPE_MISMATCH)

This is caused by the parsing of features.geometry.coordinates.

  1. Let's check its data type.
SELECT DISTINCT toTypeName(features.geometry.coordinates) AS geometry
FROM file('municipios_ign.geojson', 'JSON')
ARRAY JOIN features

┌─geometry──────────────────────────────────────┐
│ Array(Array(Array(Array(Nullable(Float64)))))
└───────────────────────────────────────────────┘

It can be fixed by casting multipolygon.properties.coordinates to Array(Array(Array(Tuple(Float64,Float64)))). To do so, we can use the function arrayMap(func,arr1,...).

SELECT distinct
toTypeName(
arrayMap(features.geometry.coordinates->
arrayMap(features.geometry.coordinates->
arrayMap(features.geometry.coordinates-> (features.geometry.coordinates[1],features.geometry.coordinates[2])
,features.geometry.coordinates),
features.geometry.coordinates),
features.geometry.coordinates)
) as toTypeName
FROM file('municipios_ign.geojson', 'JSON')
ARRAY JOIN features;

┌─toTypeName───────────────────────────────────────────────────────┐
│ Array(Array(Array(Tuple(Nullable(Float64), Nullable(Float64)))))
└──────────────────────────────────────────────────────────────────┘
  1. Insert the data.
INSERT INTO geojson
SELECT
type as type,
name as name,
crs.type as crsType,
crs.properties.name as crsName,
features.type as featureType,
features.properties.FID id,
features.properties.INSPIREID inspiredId,
features.properties.NATCODE natCode,
features.properties.NAMEUNIT nameUnit,
features.properties.CODNUT1 codNut1,
features.properties.CODNUT2 codNut2,
features.properties.CODNUT3 codNut3,
features.properties.CODIGOINE codigoIne,
features.properties.SHAPE_Length shapeLength,
features.properties.SHAPE_Area shapeArea,
features.geometry.type geometryType,
arrayMap(features.geometry.coordinates->
arrayMap(features.geometry.coordinates->
arrayMap(features.geometry.coordinates-> (features.geometry.coordinates[1],features.geometry.coordinates[2]),features.geometry.coordinates)
,features.geometry.coordinates)
,features.geometry.coordinates) geometry
FROM file('municipios_ign.geojson', 'JSON')
ARRAY JOIN features;
SELECT count()
FROM geojson

┌─count()─┐
8205
└─────────┘

SELECT DISTINCT toTypeName(geometry)
FROM geojson

┌─toTypeName(geometry)─┐
MultiPolygon
└──────────────────────┘

Conclusion

Handling JSON can result in a complex task. This tutorial addressed a scenario where a nested object array could make this task even more difficult.
For any other JSON-related requirements, please refer to our documentation.

· 3 min read

Introduction

Filtering a ClickHouse table by an array-column is a common task and the product offers a lot of functions to work with array-columns.

In this article, we're going to focus on filtering a table by an array-column, but the video below covers a lot of other array-related functions:

Example

We'll use an example of a table with two columns tags_string and tags_int that contain an array of strings and integers respectively.

  • Create a sample database and table.
CREATE DATABASE db1;
  • Create a sample table
CREATE TABLE db1.tags_table
(
`id` UInt64,
`tags_string` Array(String),
`tags_int` Array(UInt64),
)
ENGINE = MergeTree
ORDER BY id;
  • Insert some sample data into the table.
INSERT INTO db1.tags_table VALUES (1, ['tag1', 'tag2', 'tag3'], [1, 2, 3]), (2, ['tag2', 'tag3', 'tag4'], [2, 3, 4]), (3, ['tag1', 'tag3', 'tag5'], [1, 3, 5]);

Filter the table using the has(arr, elem) function to return the rows where the arr array contains the elem element.

Filter the table to return the rows where the tags_string array contains the tag1 element.

SELECT * FROM db1.tags_table WHERE has(tags_string, 'tag1');
┌─id─┬─tags_string────────────┬─tags_int─┐
│ 1 │ ['tag1','tag2','tag3'] │ [1,2,3] │
│ 3 │ ['tag1','tag3','tag5'] │ [1,3,5] │
└────┴────────────────────────┴──────────┘

Use the hasAll(arr, elems) function to return the rows where all the elements in the elems array are present in the arr array.

Filter the table to return the rows where all the elements in the tags_string array are present in the ['tag1', 'tag2'] array.

SELECT * FROM db1.tags_table WHERE hasAll(tags_string, ['tag1', 'tag2']);
┌─id─┬─tags_string────────────┬─tags_int─┐
│ 1 │ ['tag1','tag2','tag3'] │ [1,2,3] │
└────┴────────────────────────┴──────────┘

Use the hasAny(arr, elems) function to return the rows where at least one element in the elems array is present in the arr array.

Filter the table to return the rows where at least one element in the tags_string array is present in the ['tag1', 'tag2'] array.

SELECT * FROM db1.tags_table WHERE hasAny(tags_string, ['tag1', 'tag2']);
┌─id─┬─tags_string────────────┬─tags_int─┐
│ 1 │ ['tag1','tag2','tag3'] │ [1,2,3] │
│ 2 │ ['tag2','tag3','tag4'] │ [2,3,4] │
│ 3 │ ['tag1','tag3','tag5'] │ [1,3,5] │
└────┴────────────────────────┴──────────┘

We can use a lambda function to filter the table using the arrayExists(lambda, arr) function.

Filter the table to return the rows where at least one element in the tags_int array is greater than 3.

SELECT * FROM db1.tags_table WHERE arrayExists(x -> x > 3, tags_int);
┌─id─┬─tags_string────────────┬─tags_int─┐
│ 2 │ ['tag2','tag3','tag4'] │ [2,3,4] │
│ 3 │ ['tag1','tag3','tag5'] │ [1,3,5] │
└────┴────────────────────────┴──────────┘

· 2 min read

Introduction

Sometimes you need to reingest all the data from one table to another.

For example, you might want to reingest data from a staging table to a production table. This article shows how to do this using the INSERT INTO statement.

Example

Below is a simple example on how it works and how to test:

  • Create a sample database
CREATE DATABASE db1;
  • Create a sample table
CREATE TABLE db1.source_table
(
city VARCHAR,
country VARCHAR,
continent VARCHAR
)
engine = MergeTree()
ORDER BY continent;
  • Insert some data into the source table
INSERT INTO db1.source_table (city, country, continent)
VALUES
('New York', 'USA', 'North America'),
('Tokyo', 'Japan', 'Asia'),
('Berlin', 'Germany', 'Europe'),
('Paris', 'France', 'Europe'),
('Cairo', 'Egypt', 'Africa'),
('Sydney', 'Australia', 'Australia');
  • Check the number of rows in the source table
SELECT COUNT(*) FROM db1.source_table;
┌─count()─┐
│ 6 │
└─────────┘
  • Create a new table with the same structure as the source table.
CREATE TABLE db1.target_table AS db1.source_table;
  • Insert all rows from the source table to the target table.
INSERT INTO db1.target_table SELECT * FROM db1.source_table;
  • Check the number of rows in the target table
SELECT COUNT(*) FROM db1.target_table;
┌─count()─┐
│ 6 │
└─────────┘

If you want to modify the structure of the new table, you can first display the structure of the source table.

SHOW CREATE TABLE db1.source_table;

Then create the new table with the modified structure. In our case we want to add a new column population to the target table.

CREATE TABLE db1.target_table_population
(
`city` String,
`country` String,
`continent` String,
`population` UInt16,
)
ENGINE = MergeTree
ORDER BY continent;
  • Insert all rows from the source table to the target table, including the new column. The population field is set to 0 for all rows.
INSERT INTO db1.target_table_population (city, country, continent, population)
SELECT city, country, continent, 0 FROM db1.source_table;
  • Check the data in the target table
SELECT * FROM db1.target_table_population LIMIT 3;
┌─city──────┬─country───┬─continent──────┬─population─┐
│ New York │ USA │ North America │ 0 │
│ Tokyo │ Japan │ Asia │ 0 │
│ Berlin │ Germany │ Europe │ 0 │
└───────────┴───────────┴────────────────┴────────────┘

· 6 min read

Users may see cases where their query is slower than expected, in the belief they are ordering or filtering by a primary key. In this article we show how users can confirm the key is used, highlighting common reasons its not.

Create table

Consider the following simple table:

CREATE TABLE logs
(
`code` LowCardinality(String),
`timestamp` DateTime64(3)
)
ENGINE = MergeTree
ORDER BY (code, toUnixTimestamp(timestamp))

Note how our ordering key includes toUnixTimestamp(timestamp) as the second entry.

Populate data

Populate this table with 100m rows:

INSERT INTO logs SELECT
['200', '404', '502', '403'][toInt32(randBinomial(4, 0.1)) + 1] AS code,
now() + toIntervalMinute(number) AS timestamp
FROM numbers(100000000)

0 rows in set. Elapsed: 15.845 sec. Processed 100.00 million rows, 800.00 MB (6.31 million rows/s., 50.49 MB/s.)

SELECT count()
FROM logs

┌───count()─┐
100000000-- 100.00 million
└───────────┘

1 row in set. Elapsed: 0.002 sec.

Basic filtering

If we filter by code we can see the number of rows scanned in the output. - 49.15 thousand. Notice how this is a subset of the total 100m rows.

SELECT count() AS c
FROM logs
WHERE code = '200'

┌────────c─┐
65607542-- 65.61 million
└──────────┘

1 row in set. Elapsed: 0.021 sec. Processed 49.15 thousand rows, 49.17 KB (2.34 million rows/s., 2.34 MB/s.)
Peak memory usage: 92.70 KiB.

Furthermore, we can confirm the use of the index with the EXPLAIN indexes=1 clause:

EXPLAIN indexes = 1
SELECT count() AS c
FROM logs
WHERE code = '200'

┌─explain────────────────────────────────────────────────────────────┐
│ Expression ((Project names + Projection))
│ AggregatingProjection │
│ Expression (Before GROUP BY)
│ Filter ((WHERE + Change column names to column identifiers))
│ ReadFromMergeTree (default.logs)
│ Indexes: │
│ PrimaryKey │
Keys: │
│ code │
│ Condition: (code in ['200', '200'])
│ Parts: 3/3
│ Granules: 8012/12209
│ ReadFromPreparedSource (_minmax_count_projection)
└────────────────────────────────────────────────────────────────────┘

Notice how the number of granules scanned 8012 is a fraction of the total 12209. The section higlighted below, confirms use of the primary key code.

PrimaryKey
Keys:
code

Granules are the unit of data processing in ClickHouse, with each typically holding 8192 rows. For further details on granules and how they are filtered we recommend reading this guide.

Note

Filtering on keys later in an ordering key will not be as efficient as filtering on those that are earlier in the tuple. For reasons why, see here

Multi-key filtering

Suppose we filter, by code and timestamp:

SELECT count()
FROM logs
WHERE (code = '200') AND (timestamp >= '2025-01-01 00:00:00') AND (timestamp <= '2026-01-01 00:00:00')

┌─count()─┐
689742
└─────────┘

1 row in set. Elapsed: 0.008 sec. Processed 712.70 thousand rows, 6.41 MB (88.92 million rows/s., 799.27 MB/s.)


EXPLAIN indexes = 1
SELECT count()
FROM logs
WHERE (code = '200') AND (timestamp >= '2025-01-01 00:00:00') AND (timestamp <= '2026-01-01 00:00:00')

┌─explain───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Expression ((Project names + Projection))
│ Aggregating │
│ Expression (Before GROUP BY)
│ Expression │
│ ReadFromMergeTree (default.logs)
│ Indexes: │
│ PrimaryKey │
Keys: │
│ code │
│ toUnixTimestamp(timestamp)
│ Condition: and((toUnixTimestamp(timestamp) in (-Inf, 1767225600]), and((toUnixTimestamp(timestamp) in [1735689600, +Inf)), (code in ['200', '200'])))
│ Parts: 3/3
│ Granules: 87/12209
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

13 rows in set. Elapsed: 0.002 sec.

In this case both ordering keys are used to filter rows, resulting in the need to only read 87 granules.

Using keys in sorting

ClickHouse can also exploit ordering keys for efficient sorting. Specifically,

When the optimize_read_in_order setting is enabled (by default), the ClickHouse server uses the table index and reads the data in order of the ORDER BY key. This allows us to avoid reading all data in case of specified LIMIT. So, queries on big data with small limits are processed faster. See here and here for further details.

This, however, requires alignment of the keys used.

For example, consider this query:

SELECT *
FROM logs
WHERE (code = '200') AND (timestamp >= '2025-01-01 00:00:00') AND (timestamp <= '2026-01-01 00:00:00')
ORDER BY timestamp ASC
LIMIT 10

┌─code─┬───────────────timestamp─┐
2002025-01-01 00:00:01.000
2002025-01-01 00:00:45.000
2002025-01-01 00:01:01.000
2002025-01-01 00:01:45.000
2002025-01-01 00:02:01.000
2002025-01-01 00:03:01.000
2002025-01-01 00:03:45.000
2002025-01-01 00:04:01.000
2002025-01-01 00:05:45.000
2002025-01-01 00:06:01.000
└──────┴─────────────────────────

10 rows in set. Elapsed: 0.009 sec. Processed 712.70 thousand rows, 6.41 MB (80.13 million rows/s., 720.27 MB/s.)
Peak memory usage: 125.50 KiB.

We can confirm that the optimization has not been exploited here by using EXPLAIN pipeline:

EXPLAIN PIPELINE
SELECT *
FROM logs
WHERE (code = '200') AND (timestamp >= '2025-01-01 00:00:00') AND (timestamp <= '2026-01-01 00:00:00')
ORDER BY timestamp ASC
LIMIT 10

┌─explain───────────────────────────────────────────────────────────────────────┐
(Expression)
│ ExpressionTransform │
(Limit)
Limit
(Sorting)
│ MergingSortedTransform 121
│ MergeSortingTransform × 12
│ LimitsCheckingTransform × 12
│ PartialSortingTransform × 12
(Expression)
│ ExpressionTransform × 12
(Expression)
│ ExpressionTransform × 12
(ReadFromMergeTree)
│ MergeTreeSelect(pool: ReadPool, algorithm: Thread) × 12 01
└───────────────────────────────────────────────────────────────────────────────┘

15 rows in set. Elapsed: 0.004 sec.

The line MergeTreeSelect(pool: ReadPool, algorithm: Thread) here does not indicate the use of the optimization but rather a standard read. This is caused by our table ordering key using toUnixTimestamp(Timestamp) NOT timestamp. Rectifying this mismatch addresses the issue:

EXPLAIN PIPELINE
SELECT *
FROM logs
WHERE (code = '200') AND (timestamp >= '2025-01-01 00:00:00') AND (timestamp <= '2026-01-01 00:00:00')
ORDER BY toUnixTimestamp(timestamp) ASC
LIMIT 10

┌─explain──────────────────────────────────────────────────────────────────────────┐
(Expression)
│ ExpressionTransform │
(Limit)
Limit
(Sorting)
│ MergingSortedTransform 31
│ BufferChunks × 3
(Expression)
│ ExpressionTransform × 3
(Expression)
│ ExpressionTransform × 3
(ReadFromMergeTree)
│ MergeTreeSelect(pool: ReadPoolInOrder, algorithm: InOrder) × 3 01
└──────────────────────────────────────────────────────────────────────────────────┘

13 rows in set. Elapsed: 0.003 sec.

· 5 min read

The following steps can be used to enable SSL for a single ClickHouse Server using Let's Encrypt, a free, automated, and open Certificate Authority (CA) designed to make it easy for anyone to secure their websites with HTTPS. By automating the certificate issuance and renewal process, Let's Encrypt ensures websites remain secure without requiring manual intervention.

Note

We assume ClickHouse has been installed at the standard package locations in the following guide. We use the domain product-test-server.clickhouse-dev.com for all examples. Substitute your domain accordingly.

  1. Verify you have a DNS A or AAAA record pointing to your server. This can be achieved using the Linux tool dig. For example, the response for product-test-server.clickhouse-dev.com if using the Cloudflare DNS server 1.1.1.1:

dig @1.1.1.1 product-test-server.clickhouse-dev.com

; <<>> DiG 9.18.28-0ubuntu0.22.04.1-Ubuntu <<>> @1.1.1.1 product-test-server.clickhouse-dev.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22315
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;product-test-server.clickhouse-dev.com. IN A

;; ANSWER SECTION:
product-test-server.clickhouse-dev.com. 300 IN A 34.248.59.9

;; Query time: 52 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Thu Dec 12 09:37:33 UTC 2024
;; MSG SIZE rcvd: 83

Notice the section below confirming the presence of an A record.

;; ANSWER SECTION:
product-test-server.clickhouse-dev.com. 300 IN A 34.248.59.9
  1. Open port 80 on your server. This port will be used for automatic certificate renewal using the ACME protocol with certbot. For AWS, this can be achieved by modifying the instance's associated Security Group.

Open_Port_80_Security_Group

  1. Install certbot e.g. using apt

sudo apt install certbot
  1. Obtain an SSL certificate

sudo certbot certonly
Saving debug log to /var/log/letsencrypt/letsencrypt.log

How would you like to authenticate with the ACME CA?
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1: Spin up a temporary webserver (standalone)
2: Place files in webroot directory (webroot)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Select the appropriate number [1-2] then [enter] (press 'c' to cancel): 1
Please enter the domain name(s) you would like on your certificate (comma and/or
space separated) (Enter 'c' to cancel): product-test-server.clickhouse-dev.com
Requesting a certificate for product-test-server.clickhouse-dev.com

Successfully received certificate.
Certificate is saved at: /etc/letsencrypt/live/product-test-server.clickhouse-dev.com/fullchain.pem
Key is saved at: /etc/letsencrypt/live/product-test-server.clickhouse-dev.com/privkey.pem
This certificate expires on 2025-03-12.
These files will be updated when the certificate renews.
Certbot has set up a scheduled task to automatically renew this certificate in the background.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
If you like Certbot, please consider supporting our work by:
* Donating to ISRG / Let's Encrypt: https://letsencrypt.org/donate
* Donating to EFF: https://eff.org/donate-le
Note

We don't have a web server running on our server, so use (1) allowing Certbot to use a standalone temporary web server.

Enter the full domain name of your server e.g. product-test-server.clickhouse-dev.com when requested.

Note

Let's Encrypt has a policy of not issuing certificates for certain types of domains, such as public cloud provider-generated domains (e.g., AWS *.compute.amazonaws.com domains). These domains are considered shared infrastructure and are blocked for security and abuse prevention reasons.

  1. Copy certificates to the ClickHouse directory.

echo '* * * * * root cp -u /etc/letsencrypt/live/product-test-server.clickhouse-dev.com/*.pem /etc/clickhouse-server/ && chown clickhouse:clickhouse /etc/clickhouse-server/*.pem && chmod 400 /etc/clickhouse-server/*.pem' | sudo tee /etc/cron.d/copy-certificates

This command sets up a cron job to automate the management of Let's Encrypt SSL certificates for a ClickHouse server. It runs every minute as the root user, copying the .pem files from the Let's Encrypt directory to the ClickHouse server's configuration directory, but only if the files have been updated. After copying, the script adjusts the ownership of the files to the clickhouse user and group, ensuring the server has the required access. It also sets secure read-only permissions (chmod 400) on the copied files to maintain strict file security. This ensures that the ClickHouse server always has access to the latest SSL certificates without requiring manual intervention, maintaining security and minimizing operational overhead.

  1. Configure the use of these certificates in clickhouse-server.

echo "https_port: 8443
tcp_port_secure: 9440
openSSL:
server:
certificateFile: '/etc/clickhouse-server/fullchain.pem'
privateKeyFile: '/etc/clickhouse-server/privkey.pem'
disableProtocols: 'sslv2,sslv3,tlsv1,tlsv1_1'" | sudo tee /etc/clickhouse-server/config.d/ssl.yaml
  1. Restart ClickHouse Server

sudo clickhouse restart
  1. Validate ClickHouse can communicate over SSL

curl https://product-test-server.clickhouse-dev.com:8443/

Ok.

For this last step to work you may need to ensure port 8443 is accessible e.g. included in your Security Group in AWS. Alternatively, if you only want to access ClickHouse from the server, modify your hosts file i.e.

echo "127.0.0.1 product-test-server.clickhouse-dev.com" | sudo tee -a /etc/hosts
Danger

If you open connections from wildcard addresses, make sure that at least one of the following measures is applied:

  • server is protected by firewall and not accessible from untrusted networks;
  • all users are restricted to a subset of network addresses (see users.xml);
  • all users have strong passwords, only secure (TLS) interfaces are accessible, or connections are only made via TLS interfaces.
  • users without passwords have read-only access.

See also: https://www.shodan.io/search?query=clickhouse

The blog Building single page applications with ClickHouse can be used as guidance for securing public instances.

The following should also work if connecting from the local machine on which ClickHouse is running. To connect via product-test-server.clickhouse-dev.com open port 9440 in your:

clickhouse client --secure --user default --password <password>

· 7 min read

ClickHouse has no pivot operator, but we can achieve similar behavior using aggregate function combinators and, in particular, the ones with the -Map suffix.

In this article, we're going to learn how to do that. There is also a video covering the same material, which you can see below:

Understanding aggregate function combinators

Let's start with a simple example. We're going to use clickhouse-local, which you can launch by running the following:

clickhouse -m --output_format_pretty_row_numbers=0

The following query calls the sumMap function, which takes in a map and sums the values of each key:

SELECT sumMap(map('ClickHouse', 1, 'ClickBench', 2));
┌─sumMap(map('ClickHouse', 1, 'ClickBench', 2))─┐
│ {'ClickBench':2,'ClickHouse':1} │
└───────────────────────────────────────────────┘

This isn't a particularly interesting example as it returns the same map that we passed in. Let's now call sumMap over multiple rows of maps;

WITH values AS (
SELECT map('ClickHouse', 3) AS value
UNION ALL
SELECT map('ClickBench', 2, 'ClickHouse', 4) AS value
)
SELECT sumMap(value)
FROM values;
┌─sumMap(value)───────────────────┐
│ {'ClickBench':2,'ClickHouse':7} │
└─────────────────────────────────┘

The key ClickHouse appeared on both rows and its values have been summed. The key ClickBench was only present on one line, so it sums a single value, which returns that value!

We can also use maxMap to find the maximum values per key:

WITH values AS (
SELECT map('ClickHouse', 3) AS value
UNION ALL
SELECT map('ClickBench', 2, 'ClickHouse', 4) AS value
)
SELECT maxMap(value)
FROM values;
┌─maxMap(value)───────────────────┐
│ {'ClickBench':2,'ClickHouse':4} │
└─────────────────────────────────┘

Or we can use avgMap to find the average value per key:

WITH values AS (
SELECT map('ClickHouse', 3) AS value
UNION ALL
SELECT map('ClickBench', 2, 'ClickHouse', 4) AS value
)
SELECT avgMap(value)
FROM values;
┌─avgMap(value)─────────────────────┐
│ {'ClickBench':2,'ClickHouse':3.5} │
└───────────────────────────────────┘

Hopefully that's given you an idea of how these function combinators work.

Real-World Application: UK housing prices dataset

Now we're going to use them on a bigger dataset in the ClickHouse SQL playground.

We can connect to the playground using clickhouse-client:

clickhouse client -m \
-h sql-clickhouse.clickhouse.com \
-u demo \
--secure

We're going to query the uk_price_paid table, so let's explore the data in that table:

SELECT * FROM uk.uk_price_paid LIMIT 1 FORMAT Vertical;
Row 1:
──────
price: 145000
date: 2008-11-19
postcode1:
postcode2:
type: semi-detached
is_new: 0
duration: leasehold
addr1:
addr2:
street: CURLEW DRIVE
locality: SCARBOROUGH
town: SCARBOROUGH
district: SCARBOROUGH
county: NORTH YORKSHIRE
category: 0

We can see above that the table contains various fields related to property sales in the UK.

Grouping and aggregating by decade

Let's work out the median prices grouped by county for each decade in the dataset:

WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
county,
medianMap(map(year, price)) AS medianPrices
FROM uk.uk_price_paid
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10;
    ┌─county─────────────┬─medianPrices───────────────────────────────────────┐
1. │ GREATER LONDON │ {1990:89972.5,2000:215000,2010:381500,2020:485000} │
2. │ TYNE AND WEAR │ {1990:46500,2000:93000,2010:130000,2020:139000} │
3. │ WEST MIDLANDS │ {1990:50000,2000:110000,2010:149950,2020:185000} │
4. │ GREATER MANCHESTER │ {1990:47000,2000:97000,2010:141171,2020:178000} │
5. │ MERSEYSIDE │ {1990:46750,2000:94972.5,2010:128000,2020:149000} │
6. │ HERTFORDSHIRE │ {1990:86500,2000:193000,2010:315000,2020:415000} │
7. │ WEST YORKSHIRE │ {1990:48995,2000:99950,2010:139000,2020:164950} │
8. │ BRIGHTON AND HOVE │ {1990:70000,2000:173000,2010:288000,2020:387000} │
9. │ DORSET │ {1990:76500,2000:182000,2010:250000,2020:315000} │
10. │ HAMPSHIRE │ {1990:79950,2000:177500,2010:260000,2020:335000} │
└────────────────────┴────────────────────────────────────────────────────┘

Filtering results

We can filter the results to only include data from 2010 and on:

WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
county,
medianMap(map(year, price)) AS medianPrices
FROM uk.uk_price_paid
WHERE year >= 2010
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10;
    ┌─county─────────────┬─medianPrices────────────────┐
1. │ GREATER LONDON │ {2010:384975,2020:485919.5} │
2. │ TYNE AND WEAR │ {2010:130000,2020:140000} │
3. │ WEST MIDLANDS │ {2010:146500,2020:185000} │
4. │ GREATER MANCHESTER │ {2010:140000,2020:177500} │
5. │ MERSEYSIDE │ {2010:130000,2020:150000} │
6. │ HERTFORDSHIRE │ {2010:315000,2020:415000} │
7. │ WEST YORKSHIRE │ {2010:140000,2020:162500} │
8. │ BRIGHTON AND HOVE │ {2010:287500,2020:387000} │
9. │ DORSET │ {2010:255750,2020:315000} │
10. │ HAMPSHIRE │ {2010:265000,2020:330000} │
└────────────────────┴─────────────────────────────┘

Combining multiple aggregations

And if we want to find the maximum price per decade we can do that using the maxMap function that we saw earlier:

WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
county,
medianMap(map(year, price)) AS medianPrices,
maxMap(map(year, price)) AS maxPrices
FROM uk.uk_price_paid
WHERE year >= 2010
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10;
    ┌─county─────────────┬─medianPrices──────────────┬─maxPrices───────────────────────┐
1. │ GREATER LONDON │ {2010:385000,2020:485250} │ {2010:594300000,2020:630000000} │
2. │ TYNE AND WEAR │ {2010:130000,2020:141000} │ {2010:448300979,2020:93395000} │
3. │ WEST MIDLANDS │ {2010:149000,2020:184250} │ {2010:415000000,2020:104500000} │
4. │ GREATER MANCHESTER │ {2010:140000,2020:175000} │ {2010:107086856,2020:319186000} │
5. │ MERSEYSIDE │ {2010:129950,2020:150000} │ {2010:300000000,2020:93395000} │
6. │ HERTFORDSHIRE │ {2010:315000,2020:415000} │ {2010:254325163,2020:93395000} │
7. │ WEST YORKSHIRE │ {2010:138500,2020:165000} │ {2010:246300000,2020:109686257} │
8. │ BRIGHTON AND HOVE │ {2010:285000,2020:387000} │ {2010:200000000,2020:71540000} │
9. │ DORSET │ {2010:250000,2020:315000} │ {2010:150000000,2020:20230000} │
10. │ HAMPSHIRE │ {2010:264000,2020:330000} │ {2010:150000000,2020:48482500} │
└────────────────────┴───────────────────────────┴─────────────────────────────────┘

Applying functions to map values

Alternatively, we can compute the average price using avgMap. Those values have a lot of decimal points, which we can clean up by using the mapApply function to call the floor function on each value in the map:

WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
county,
medianMap(map(year, price)) AS medianPrices,
mapApply((k, v) -> (k, floor(v)), avgMap(map(year, price))) AS avgPrices
FROM uk.uk_price_paid
WHERE year >= 2010
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10;
    ┌─county─────────────┬─medianPrices──────────────┬─avgPrices─────────────────┐
1. │ GREATER LONDON │ {2010:382000,2020:490000} │ {2010:626091,2020:807240} │
2. │ TYNE AND WEAR │ {2010:127000,2020:140000} │ {2010:176955,2020:225770} │
3. │ WEST MIDLANDS │ {2010:148500,2020:183000} │ {2010:204128,2020:257226} │
4. │ GREATER MANCHESTER │ {2010:140000,2020:177500} │ {2010:195592,2020:251165} │
5. │ MERSEYSIDE │ {2010:127995,2020:150000} │ {2010:182194,2020:206062} │
6. │ HERTFORDSHIRE │ {2010:317500,2020:415000} │ {2010:414134,2020:529409} │
7. │ WEST YORKSHIRE │ {2010:140000,2020:164500} │ {2010:185121,2020:234870} │
8. │ BRIGHTON AND HOVE │ {2010:285000,2020:387000} │ {2010:372285,2020:527184} │
9. │ DORSET │ {2010:250000,2020:315000} │ {2010:305581,2020:370739} │
10. │ HAMPSHIRE │ {2010:265000,2020:330000} │ {2010:335945,2020:425196} │
└────────────────────┴───────────────────────────┴───────────────────────────┘

Flexible grouping: counties, districts, and postcodes

Let's try to group by some different fields. This time we're going to compute the median price per decade grouped by county and district:

WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
county,
district,
medianMap(map(year, price)) AS medianPrices
FROM uk.uk_price_paid
WHERE year >= 2010
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10
    ┌─county─────────────┬─district───────────────┬─medianPrices────────────────┐
1. │ GREATER LONDON │ CROYDON │ {2010:298475,2020:400000} │
2. │ GREATER LONDON │ CITY OF WESTMINSTER │ {2010:800000,2020:935000} │
3. │ GREATER LONDON │ SOUTHWARK │ {2010:437000,2020:540000} │
4. │ TYNE AND WEAR │ NEWCASTLE UPON TYNE │ {2010:144000,2020:162500} │
5. │ WEST MIDLANDS │ WALSALL │ {2010:137450,2020:162000} │
6. │ GREATER LONDON │ CITY OF LONDON │ {2010:725875,2020:840000} │
7. │ GREATER LONDON │ HILLINGDON │ {2010:329125,2020:439000} │
8. │ GREATER MANCHESTER │ MANCHESTER │ {2010:144972.5,2020:190000} │
9. │ GREATER LONDON │ HAMMERSMITH AND FULHAM │ {2010:622250,2020:750000} │
10. │ GREATER LONDON │ ISLINGTON │ {2010:500000,2020:640000} │
└────────────────────┴────────────────────────┴─────────────────────────────┘

We could also choose to group by year and then concatenate postcode1 and postcode2 in the map:

WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
year,
medianMap(map(postcode1 || ' ' || postcode2, price)) AS medianPrices
FROM uk.uk_price_paid
WHERE postcode1 LIKE 'NP1'
GROUP BY ALL;
   ┌─year─┬─medianPrices────────────────────────────────────────────────────────┐
1. │ 1990 │ {'NP1 4PB':9000} │
2. │ 2000 │ {'NP1 4SR':28475,'NP1 7HZ':200000} │
3. │ 2010 │ {'NP1 4PB':5000,'NP1 4QJ':1075000,'NP1 4SR':58000,'NP1 8BR':200000} │
4. │ 2020 │ {'NP1 5DW':140000} │
└──────┴─────────────────────────────────────────────────────────────────────┘