Avro and avro schemas - how they work and why they are useful

You have kafka as your message broker up and running and you may wonder: In which format should I send my data around? Maybe the string format pops up in your mind. Why not just put all fields into a long string and separate them with a comma?

What a great option, no need to worry about the format of the data! And yes, you you could do that. But it will get really messy. Trust me. Let’s assume you wanted to send some data:

"Max, Mustermann, 42, null"

What would that mean? What is 42? Which information is missing? And what would happen if you decide to add more fields or remove a field?

If you store your data in a relational database such as mysql or postgres, you would create a table definition with named columns and specify data types. You should do the same for streaming platforms. Avro let’s you define a schema for your data and serializes the data using this schema into a binary format, which can be deserialized by any other application.

"Max, Mustermann, 42, null" could be associated with an avro schema like this.

{
  "type": "record",
  "name": "person",
  "fields": [
      {"name": "first_name",  "type": "string" }
    , {"name": "last_name", "type": "string" }
    , {"name": "age", "type": "float" }
    , {"name": "nationality", "type": "string", "default": null}
  ]
}

So the avro schema is just json format declared data structured with a schema attached to it. Like for relational databases, you have to specify a name and type (and optionally other fields such as doc and default values).

name: name of your field
doc: documentation of your field (optional)
type: data type for your field. Can be primitive ("boolean", "int"), complex (record,arrays, maps, unions), names and aliases e.g. "boolean", "int"
default: default value for your field (optional)

The schema is stored in json format at the beginning of each message. The schema registry is an app that holds a local copy of a schema and handles the distribution of schemas to the producer and consumer. It should be hosted in a schema registry outside of your Kafka cluster (e.g. with Confluent). It also checks if the data you send and received complies with the defined schema. Nice!

Every schema has a unique id and captures “only” a certain point in time. We all know that data and schemas will evolve over time. Fields are added, fields are removed. Let’s assume you wanted to add a new field “gender”. For relational databases this is an issue, because they require the same schema for every row.

Avro schemas are backward and forward compatible. You need to specify a default value for the new field. If old data is encoded with the new schema, the default value will be automatically assigned to this field.

{
  "type": "record",
  "name": "person",
  "fields": [
      {"name": "first_name",  "type": "string" }
    , {"name": "last_name", "type": "string" }
    , {"name": "age", "type": "float" }
    , {"name": "nationality", "type": "string", "default": null}
    , {"name": "gender", "type": "string", "default": null}
  ]
}

The new schema is also forward compatible with the old schema. When projecting data written with the new schema to the old one, the new field is simply dropped. Had the new schema dropped the field nationality, it would not be forward compatible with the original user schema since we wouldn’t know how to fill in the value for nationality for the new data. You can read old data with the new schema and new data with the old schema. Very convenient!

Summary

avro is a row-based binary storage format that supports serialization. It uses JSON to describe the data and a binary format to optimize storage size. Schema changes are forward and backward compatible.

Avro and avro schemas - how they work and why they are useful

Summary

Heike Maria (PhD)