SQL like query on CSV and other tabular data

CSV files are ubiquitous in the programming world and most data interchange is done using the CSV format. One problem however is that you cannot quickly check for certain rows depending on a particular field criteria. As an example, say you need to find all the rows from a CSV file that has a particular value in a field. The most probable path would be to import the CSV in a spreadsheet and then process it to find the particular information.

In this post we will look into a command-line tool that allows you to query CSV data with SQL like syntax. q – Text as Data is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files). You can look at q as a meta-tool which provides access to all the data-related tools that SQL provides (e.g. expressions, ordering, grouping, aggregation etc.), but on tabular data files.

q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.

Installation

Q is available for most popular platforms and you can download the version for your system from here. Windows users can get a nice installer, q.exe will be added to the PATH so you can access it everywhere.

Examples

We will take the following csv data as an example for testing, the first line is the header. Assume that the data is stored in a text file named ‘sales.csv’.

custno, firstname, lastname, purchase
23242, john, doe, 234.15
23253, jane, doe, 124.00
23253, jane, doe, 238.50
23221, greg, johnson, 345.00
23210, howard, gardner, 145.65
23253, jane, doe, 24.00
21253, james, garner, 224.00
13253, ron, howard, 224.00
26263, james, garner, 65.20
31226, helen, carter, 234.00

The simplest query we can do is to get the number of rows. You will need to run the command from the command line or shell, the main program file is ‘q’. The output is shown on the next line. Here the tables names are the CSV files (note that spaces in file names are not supported yet).

D:\csv-data> q -H -d, "SELECT COUNT(1) FROM data.csv"
10

This returns the total count of rows in the file (excluding headers). The -H option signals that the first row is a header. The -d specifies the delimiter, “,” here. The default delimiter is taken as a space, so be sure that you specify the delimiter correctly.

Take another practical example where the usefulness of the tool really shines.

D:\csv>q -H -d, "SELECT * FROM sales.csv WHERE lastname = 'doe'"
23242,john,doe,234.15
23253,jane,doe,124.0
23253,jane,doe,238.5
23253,jane,doe,24.0

The following will sum the purchase field for all the rows having lastname as ‘doe’.

D:\csv>q -H -d, "SELECT SUM(purchase) FROM sales.csv WHERE lastname = 'doe'"
620.65

Another example on a numeric field.

D:\csv>q -H -d, "SELECT * FROM sales.csv WHERE purchase > 200.00"
23242,john,doe,234.15
23253,jane,doe,238.5
23221,greg,johnson,345.0
21253,james,garner,224.0
13253,ron,howard,224.0
31226,helen,carter,234.0

q also supports GROUP BY as the following example illustrates.

D:\csv>q -H -d, "SELECT lastname, count(lastname) FROM sales.csv GROUP BY lastname"
carter,1
doe,4
gardner,1
garner,2
howard,1
johnson,1

Read input from standard input

You can also read from standard input so this essentially allows you to pipe data from another process to q. Use “-” if you want to read from stdin (e.g. q “SELECT * FROM -“)

D:\csvS>cat sales.csv | q -H -d, "SELECT * FROM - WHERE purchase > 200.00"
23242,john,doe,234.15
23253,jane,doe,238.5
23221,greg,johnson,345.0
21253,james,garner,224.0
13253,ron,howard,224.0
31226,helen,carter,234.0

q can also get some runtime flags you can try, the complete list is given here.

Some limitations

There are a couple of limitation in the current implementation.
1. It is possible that some rare cases of subqueries are not supported yet.
2. Spaces in file names are not supported yet.