As a streamer, why should I care about Druid?

I’m a sucker for technology. Love learning new things and following the trends. I started my working life with Oracle doing database things, moved to Hortonworks (now Cloudera) for plenty of fun with Big Data, and then on to Confluent where I was an event streamer.

Now here I am at Imply.

The growth of Apache Kafka and similar streaming technologies is undeniable. Streaming technologies allow us to react to each event in turn as they occur rather than waiting for the DB. Databases remain fab at what they do. BUT. There are a few things they don’t do great. Or perhaps better to say, to make them do those things great would cost a bucket load of effort/money.

This is a real-time world and we want to be working in the right now. I want to know where my pizza (or package, or taxi-cab, or transaction or airplane or WHATEVER) is right now. I don’t want to receive a report of locations it passed through once the database finally catches up.

That explains why I went to work with Apache Kafka. Why did I then go from working with Apache Kafka to Apache Druid at Imply? And why do I feel you should follow me too? (metaphorically - not literally. We only have a small office!). That is what I’ll try to explain in this short (and a little bit silly) blog.

No thanks, we already got one.

You may have seen me on the Imply booth at AWS World. Perhaps you caught me doing a standup at Confluent Current or heard me enthisiastically discussing real-time streaming analytics to any poor unfortunate who happened to hesitate near the Imply booth (Maybe you were pausing to oggle the free socks - Another one takes the bait!)

The opening question I am always asked is “So just what is Apache Druid?”

To which I confidently reply “A super-fast database.”

This invariably gets a response of “No thanks, we already got one” which in turn brings to mind one of my favorite comedy sketches with Monty Python & The Quest for the Holy Grail. “He said he already got one!!”

Of course you already got one! This is 2023. Who doesn't have half a dozen different flavours of database these days?! Undaunted, I push on, trying to convince my captive (figuratively, not literally) audience why they should care about my particular database. Stop, wait, don’t run away!

What did the Romans ever do for us?

As I try to explain why Streamers should care about Druid, what comes to mind is another of my absolute favorite Monty Python sketches. Maybe you know this one “So - what have the Romans ever done for us”. (<-- Click play if you never saw it).

The exchange goes a little like this….

Customer: So. What could a streaming-analytics database ever do for us?
Me: It’s very fast. I mean sub-second even at massive scale.

Customer: Pfft. We’ve got one of those
Me: Sub-second queries handling TB or PB of data!

Customer: Yeah. Ours does that. It’s very nice.
Me: Handles high concurrency at the lowest cost/most efficient price!

Customer: Everyone says that.
Me: Automated fault tolerance! Continuous backups!

Customer: So apart from speed, scale, concurrency, cost efficiency, fault tolerance and automated backups, WHAT could a streaming-analytics database ever do for us??

Joking aside, it’s a fair question. What separates Apache Druid from the herd of other databases out there? Why should I spend my valuable time even looking at Apache Druid?

Glad you asked.

Why streamers need to care about streaming analytics

  • You want to put pictures on your events. We’re all about that. Draggy, droppy, datacubes, dashboards, visualisations etc. It’s what we do. We got all that, and we got it in spades. Want a bar chart - we have it. Pie chart? Yep. Street map - those too.

  • You want your analytics to be real-time. You’ve done the hard work, you’ve made the leap from batch to event processing. Lord knows, it's not been easy, but you’ve won the hearts and minds, converted the unbelievers.
    AND NOW you tell me you want to run analytics on those events by pusing them back into a database??? It makes no sense! No no no!
    You just moved away from that. If you’re processing events as they happen, you want to have analytics on those events as they happen.
    Well you’re in luck - Druid is proper real-time. We call it “Query On Arrival”. As soon as the event arrives it’s available to query - no waiting for a long-running query to be re-run, no waiting for your shards to be recomputed. No need to pre-cache queries or query results.

  • It couldn’t be easier to use. Native ingestion from your favorite streaming implementation (Apache Kafka, Confluent Kafka, Amazon MSK, Amazon Kinesis, Azure Event Hub etc) means that in the blink of an eye, you can connect the hose from your streaming source, turn on the tap and start pouring the data into Druid. Whether you’re a command line warrior or GUI guru, give us your bootstrap server and your topic name and watch the events start flowing so you can start analyzing.

  • Confluent use it. Confluent use Druid for their customers’ cluster metrics! yep, the kafka guys. What better endorsement than that?

  • Druid’s behavior will be reassuringly familiar to all my database and streaming homies. The way Druid works with streaming ingestion will be very familiar. For example, ingestion from Kafka scales with partitions. It adheres to exactly once semantics. Ingestion is tracked by recording offsets for partitions and as such is resumable: In short, it behaves like a typical consumer. And it’s parallel all the way from producer to query.

Interested?

Hopefully, right now you’re pounding the desk and yelling “Adam, I’m in! Tell me how to connect my data-hose!”. Alrighty then - let me point you at a few quick starts which will have you data cubing, dashboarding & visualizing your streaming data in a flash…

You can thank me at the next Druid event. I’ll be on the stand trying to work out how many swallows it would take to carry a coconut.

Click the swallow below (European not African) to get started.