osquery: UNIX as a SQL database

Earlier this year, in “UNIX as a SQL database”, I wrote:

UNIX is full of tables. When we talk about “processes”, we’re really referring to “rows in a process table.” When we talk about “file descriptors”, we’re really referring to “rows in a per-process file descriptor table”. There are other tables, too: a global file table, an inode table, routing tables, a mount table, page tables, and other tables I don’t know about.

These “tables” are custom in-memory data structures, but can be understood relationally. Here’s a simplified description of them in SQL.

When I wrote that, it was only as a conceptual framework and a pipe dream. But I just discovered that my pipe dream is not just a dream: it exists, it’s active, it has major backing, and over 10K stars on GitHub! 😱 This project is osquery. osquery lets you query a UNIX system using SQL.

osquery effectively replaces hundreds of crusty, confusing UNIX tools which would otherwise take decades to learn. I often want to find the ID of the process listening on a particular TCP port. Through 20 minutes of horrible UNIX incantations, I came up with:

$ lsof -P -n  -Fp -s TCP:LISTEN -i :15000 | grep '^p' | cut -dp -f 2
904

I challenge you to understand the above. I don’t even know if it’s correct. By contrast, with osquery, I was able to write:

$ osqueryi --header=false --list \
  "select pid from process_open_sockets
   where remote_port=0 and local_port=15000"
904

I’m sure you’ve heard the quote: ‘Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.’ This quote can only have come from the UNIX world. UNIX is obsessed with plaintext formats, and takes pride in grep, sed, cut and all its other text-munging tools. But the fact is that these are buggy and hard to learn. lsof admits this and make some concessions with its -F flag, which makes its output “suitable for post-processing”. osquery’s most obvious advantage is to do away with all of this, and allow querying the domain objects instead of querying some ad-hoc plaintext.

What program is process 904? What are its arguments? With standard UNIX tools, you’ll have to glue together lsof and ps with some shell magic:

$ ps -o args= -p \
  $(lsof -P -n  -Fp -s TCP:LISTEN -i :15000 | grep '^p' | cut -dp -f 2)
nc -l 15000

After another 5 minutes of googling: ah, it’s nc, instructed to listen on port 15000. UNIX talks of each tool “doing one thing”, but it’s a pain to glue tools together like this. Nearly all UNIX tools admit this with some ad hoc concessions. For example, if you know the incantation, lsof can print you the name of the program that’s listening on that port. But if you want the command-line arguments, you’re out of luck and you’ll need to glue together multiple tools.

If instead I were to use osquery, I could use JOIN over the domain objects instead of joining ad hoc plaintext:

$ osqueryi --header=false --list \
  "select p.cmdline from process_open_sockets s
  join processes p on s.pid = p.pid
  where local_port=15000 and remote_port=0"
nc -l 15000

But there’s another, more subtle advantage of osquery. osquery clearly exposes the semantics of UNIX in a way that ad-hoc tooling never can. Instead of reading ambiguous English man pages, you can read the database schema:

$ osqueryi
Using a virtual database. Need help, type '.help'
osquery> .schema process_open_sockets
CREATE TABLE process_open_sockets(`pid` INTEGER, `fd` BIGINT, `socket` BIGINT,
  `family` INTEGER, `protocol` INTEGER, `local_address` TEXT,
  `remote_address` TEXT, `local_port` INTEGER, `remote_port` INTEGER,
  `path` TEXT, PRIMARY KEY (`pid`)) WITHOUT ROWID;

There’s one big downside: osquery is probably not installed on machines you access. I don’t like using non-standard tools if I can afford to use the defaults. But I don’t feel like I can afford to use the defaults: the time investment is too expensive to consult the man page for lsof and ps every time I want to investigate a process.

Tagged #programming, #unix.

More by Jim

What does the dot do in JavaScript?

foo.bar, foo.bar(), or foo.bar = baz - what do they mean? A deep dive into prototypical inheritance and getters/setters. 2020-11-01

Smear phishing: a new Android vulnerability

Trick Android to display an SMS as coming from any contact. Convincing phishing vuln, but still unpatched. 2020-08-06

A probabilistic pub quiz for nerds

A “true or false” quiz where you respond with your confidence level, and the optimal strategy is to report your true belief. 2020-04-26

Time is running out to catch COVID-19

Simulation shows it’s rational to deliberately infect yourself with COVID-19 early on to get treatment, but after healthcare capacity is exceeded, it’s better to avoid infection. Includes interactive parameters and visualizations. 2020-03-14

The inception bar: a new phishing method

A new phishing technique that displays a fake URL bar in Chrome for mobile. A key innovation is the “scroll jail” that traps the user in a fake browser. 2019-04-27

The hacker hype cycle

I got started with simple web development, but because enamored with increasingly esoteric programming concepts, leading to a “trough of hipster technologies” before returning to more productive work. 2019-03-23

Project C-43: the lost origins of asymmetric crypto

Bob invents asymmetric cryptography by playing loud white noise to obscure Alice’s message, which he can cancel out but an eavesdropper cannot. This idea, published in 1944 by Walter Koenig Jr., is the forgotten origin of asymmetric crypto. 2019-02-16

How Hacker News stays interesting

Hacker News buried my post on conspiracy theories in my family due to overheated discussion, not censorship. Moderation keeps the site focused on interesting technical content. 2019-01-26

My parents are Flat-Earthers

For decades, my parents have been working up to Flat-Earther beliefs. From Egyptology to Jehovah’s Witnesses to theories that human built the Moon billions of years in the future. Surprisingly, it doesn’t affect their successful lives very much. For me, it’s a fun family pastime. 2019-01-20

The dots do matter: how to scam a Gmail user

Gmail’s “dots don’t matter” feature lets scammers create an account on, say, Netflix, with your email address but different dots. Results in convincing phishing emails. 2018-04-07

The sorry state of OpenSSL usability

OpenSSL’s inadequate documentation, confusing key formats, and deprecated interfaces make it difficult to use, despite its importance. 2017-12-02

I hate telephones

I hate telephones. Some rational reasons: lack of authentication, no spam filtering, forced synchronous communication. But also just a visceral fear. 2017-11-08

The Three Ts of Time, Thought and Typing: measuring cost on the web

Businesses often tout “free” services, but the real costs come in terms of time, thought, and typing required from users. Reducing these “Three Ts” is key to improving sign-up flows and increasing conversions. 2017-10-26

Granddad died today

Granddad died. The unspoken practice of death-by-dehydration in the NHS. The Liverpool Care Pathway. Assisted dying in the UK. The importance of planning in end-of-life care. 2017-05-19

How do I call a program in C, setting up standard pipes?

A C function to create a new process, set up its standard input/output/error pipes, and return a struct containing the process ID and pipe file descriptors. 2017-02-17

Your syntax highlighter is wrong

Syntax highlighters make value judgments about code. Most highlighters judge that comments are cruft, and try to hide them. Most diff viewers judge that code deletions are bad. 2014-05-11

Want to build a fantastic product using LLMs? I work at Granola where we're building the future IDE for knowledge work. Come and work with us! Read more or get in touch!

This page copyright James Fisher 2017. Content is not associated with my employer. Found an error? Edit this page.

osquery: UNIX as a SQL database

Similar posts

More by Jim