Importing JSON into Hadoop via Kafka

mpd · on Jan 18, 2017

> it is impractical to transport data in a binary format that is unparseable without extra information (schemas)

Well, Avro embeds the schema with the messages, and deserializing the message will default to using that schema.

I think JSON is great for a one-off, but I'd hate to be on a team that doesn't schema their data exchange formats.

andrewmccall · on Jan 18, 2017

Do you embed the schema with every Kafka message? Otherwise you need to use some out of band method to distribute schema updates with avro

WWLink · on Jan 18, 2017

No. I think the ideal use case here is you use JSON over Kafka, and store the data in Avro files. The avro files have the schema at the start.

I wonder what they're using to retrieve that data for analysis later on. I do something very similar to this, but having to sift through millions of messages for a given time period, to find a subset of said messages is kinda annoying.

It's a good thing they didn't use Confleunt Camus. -shudder- It supports Avro-Over-Kafka out of the box, on the caveat that every single time it reads a message off kafka, it pings the schema registry to get the schema for it. That's great and all, until you've got thousands of messages per second.

lacksconfidence · on Jan 18, 2017

> I wonder what they're using to retrieve that data for analysis later on. I do something very similar to this, but having to sift through millions of messages for a given time period, to find a subset of said messages is kinda annoying.

It looks like hive or spark, depending on the use case. The data is also loaded into Druid when looking at statistics, rather than getting full data about individual messages.

> It's a good thing they didn't use Confleunt Camus. -shudder- It supports Avro-Over-Kafka out of the box, on the caveat that every single time it reads a message off kafka, it pings the schema registry to get the schema for it. That's great and all, until you've got thousands of messages per second.

They are using camus, much of the post is dedicated to it. It looks like they are also running avro over kafka+camus for some application logging, but at a lower volume (~10k messages/sec peak)

WWLink · on Jan 19, 2017

iirc, confluent has their own version of kafka/camus that uses a schema registry where the first few bytes of the kafka messages identify the schema.

The wikimedia article sounds like they're just using regular camus and their own interpreter. that would perform a bit better :) Still wonder why they didn't just write a spark job to do the same thing.

koolba · on Jan 18, 2017

> It's a good thing they didn't use Confleunt Camus. -shudder- It supports Avro-Over-Kafka out of the box, on the caveat that every single time it reads a message off kafka, it pings the schema registry to get the schema for it. That's great and all, until you've got thousands of messages per second.

Why would they need to hit the registry for every message? Wouldn't the schemas be immutable and thus able to be (at least temporarily) cached? They might have millions of messages but it's doubtful they have millions of message schemas.

vlahmot · on Jan 19, 2017

The schemas are not immutable. You also don't hit the schema registry for every message either, in fact you can skip the registry all together and provide the schema manually if you would like.

lacksconfidence · on Jan 19, 2017

You could provide them manually, but then any schema upgrade becomes a big pain. Wikimedia, as one example, uses versioned schemas. As such each version is immutable and can be pulled from the cache. Each kafka message has a null byte, and then a long version number prefixed to indicate how it should be decoded.

https://github.com/wikimedia/analytics-refinery-source/blob/...