How to Identify the Most Expensive Queries in ClickHouse
Answer
The query_log table in the system database keeps track of all your queries, including:
- how much memory the query consumed, and
- how much CPU time was needed
The following query returns the top 10 queries, where "top" means the queries that used the most memory:
The response looks like:
The initial_query_id represents the ID of the initial query for distributed query execution launched from the node receiving the request. The query_id contains the ID of the child query executed on a different node. See this article for more details.
You can use the query ID to extract more details about the query. Let's research our longest running query above (the first one):
It turns out to be the query we used to insert a few billion rows of data into a table named youtube (see the YouTube dislikes dataset):
initial_query_id VS query_id
Note that in a clustered ClickHouse environment (like ClickHouse Cloud) initial_query_id represents the ID of the initial query for distributed query execution launched from the node receiving the request;
then query_id field will contain the ID of the child query executed on a different node.
If we add query_id to the above query we pin our search around initial_query_id = a7262fa2-bd8b-4b51-a359-621ccf282417 and hostname():
we will get:
Note we have several results from several hosts (the different cluster nodes).
To refine further and get only the child queries we could also add the query_id != initial_query_id condition to the WHERE clause:
returns all the child queries executed on the remote nodes (remote to the node where the query was first thrown at):
conversely, query_id = initial_query_id will return only the queries executed on the local node where the distributed query was first thrown at:
giving:
As for other System Tables, you can find more details about the meaning of each field in our docs.
