Advanced Tutorial
Overview
Learn how to ingest and query data in ClickHouse using a New York City taxi example dataset.
Prerequisites
You need access to a running ClickHouse service to complete this tutorial. For instructions, see the Quick Start guide.
Create a new table
The New York City taxi dataset contains details about millions of taxi rides, with columns including tip amount, tolls, payment type, and more. Create a table to store this data.
- Connect to the SQL console:
- For ClickHouse Cloud, select a service from the dropdown menu and then select SQL Console from the left navigation menu.
- For self-managed ClickHouse, connect to the SQL console at https://_hostname_:8443/play. Check with your ClickHouse administrator for the details.
- Create the following tripstable in thedefaultdatabase:
Add the dataset
Now that you've created a table, add the New York City taxi data from CSV files in S3.
- 
The following command inserts ~2,000,000 rows into your tripstable from two different files in S3:trips_1.tsv.gzandtrips_2.tsv.gz:
- 
Wait for the INSERTto finish. It might take a moment for the 150 MB of data to be downloaded.
- 
When the insert is finished, verify it worked: This query should return 1,999,657 rows. 
Analyze the data
Run some queries to analyze the data. Explore the following examples or try your own SQL query.
- 
Calculate the average tip amount: Expected output
- 
Calculate the average cost based on the number of passengers: Expected outputThe passenger_countranges from 0 to 9:
- 
Calculate the daily number of pickups per neighborhood: Expected output
- 
Calculate the length of each trip in minutes, then group the results by trip length: Expected output
- 
Show the number of pickups in each neighborhood broken down by hour of the day: Expected output
- 
Retrieve rides to LaGuardia or JFK airports: Expected output
Create a dictionary
A dictionary is a mapping of key-value pairs stored in memory. For details, see Dictionaries
Create a dictionary associated with a table in your ClickHouse service. The table and dictionary are based on a CSV file that contains a row for each neighborhood in New York City.
The neighborhoods are mapped to the names of the five New York City boroughs (Bronx, Brooklyn, Manhattan, Queens and Staten Island), as well as Newark Airport (EWR).
Here's an excerpt from the CSV file you're using in table format. The LocationID column in the file maps to the pickup_nyct2010_gid and dropoff_nyct2010_gid columns in your trips table:
| LocationID | Borough | Zone | service_zone | 
|---|---|---|---|
| 1 | EWR | Newark Airport | EWR | 
| 2 | Queens | Jamaica Bay | Boro Zone | 
| 3 | Bronx | Allerton/Pelham Gardens | Boro Zone | 
| 4 | Manhattan | Alphabet City | Yellow Zone | 
| 5 | Staten Island | Arden Heights | Boro Zone | 
- Run the following SQL command, which creates a dictionary named taxi_zone_dictionaryand populates the dictionary from the CSV file in S3. The URL for the file ishttps://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv.
Setting LIFETIME to 0 disables automatic updates to avoid unnecessary traffic to our S3 bucket. In other cases, you might configure it differently. For details, see Refreshing dictionary data using LIFETIME.
- 
Verify it worked. The following should return 265 rows, or one row for each neighborhood: 
- 
Use the dictGetfunction (or its variations) to retrieve a value from a dictionary. You pass in the name of the dictionary, the value you want, and the key (which in our example is theLocationIDcolumn oftaxi_zone_dictionary).For example, the following query returns the BoroughwhoseLocationIDis 132, which corresponds to JFK airport):JFK is in Queens. Notice the time to retrieve the value is essentially 0: 
- 
Use the dictHasfunction to see if a key is present in the dictionary. For example, the following query returns1(which is "true" in ClickHouse):
- 
The following query returns 0 because 4567 is not a value of LocationIDin the dictionary:
- 
Use the dictGetfunction to retrieve a borough's name in a query. For example:This query sums up the number of taxi rides per borough that end at either the LaGuardia or JFK airport. The result looks like the following, and notice there are quite a few trips where the pickup neighborhood is unknown: 
Perform a join
Write some queries that join the taxi_zone_dictionary with your trips table.
- 
Start with a simple JOINthat acts similarly to the previous airport query above:The response looks is identical to the dictGetquery:noteNotice the output of the above JOINquery is the same as the query before it that useddictGetOrDefault(except that theUnknownvalues are not included). Behind the scenes, ClickHouse is actually calling thedictGetfunction for thetaxi_zone_dictionarydictionary, but theJOINsyntax is more familiar for SQL developers.
- 
This query returns rows for the the 1000 trips with the highest tip amount, then performs an inner join of each row with the dictionary: noteGenerally, we avoid using SELECT *often in ClickHouse. You should only retrieve the columns you actually need. However, this query is slower for the purposes of the example.
Next steps
Learn more about ClickHouse with the following documentation:
- Introduction to Primary Indexes in ClickHouse: Learn how ClickHouse uses sparse primary indexes to efficiently locate relevant data during queries.
- Integrate an external data source: Review data source integration options, including files, Kafka, PostgreSQL, data pipelines, and many others.
- Visualize data in ClickHouse: Connect your favorite UI/BI tool to ClickHouse.
- SQL Reference: Browse the SQL functions available in ClickHouse for transforming, processing and analyzing data.
