Apache Avro

Apache Avro
Developer(s)Apache Software Foundation
Initial release2 November 2009; 14 years ago (2009-11-02)
Stable release
1.11.3 / September 23, 2023; 5 months ago (2023-09-23)
RepositoryAvro Repository
Written inJava, C, C++, C#, Perl, Python, PHP, Ruby
TypeRemote procedure call framework
LicenseApache License 2.0
Websiteavro.apache.org

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more machine-readable based on JSON.

It is similar to Thrift and Protocol Buffers, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Apache Spark SQL can access Avro as a data source.

Avro Object Container File

An Avro Object Container File consists of:

A file header consists of:

  • Four bytes, ASCII 'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values 0x4F 0x62 0x6A 0x01).
  • File metadata, including the schema definition.
  • The 16-byte, randomly-generated sync marker for this file.

For data blocks Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.

Schema definition

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).

Simple schema example:

{
"namespace":"example.avro",
"type":"record",
"name":"User",
"fields":[
{"name":"name","type":"string"},
{"name":"favorite_number","type":["null","int"]},
{"name":"favorite_color","type":["null","string"]}
]
}

Serializing and deserializing

Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.

Example serialization and deserialization code in Python

Serialization:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

# Need to know the schema to write. According to 1.8.2 of Apache Avro
schema = avro.schema.parse(open("user.avsc", "rb").read())

writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 8, "favorite_color": "red"})
writer.close()

File "users.avro" will contain the schema in JSON and a compact binary representation of the data:

$od-v-tx1zusers.avro
00000004f626a0104146176726f2e636f646563>Obj...avro.codec<
0000020086e756c6c166176726f2e736368656d>.null.avro.schem<
000004061ba037b2274797065223a2022726563>a..{"type": "rec<
00000606f7264222c20226e616d65223a202255>ord", "name": "U<
0000100736572222c20226e616d657370616365>ser", "namespace<
0000120223a20226578616d706c652e6176726f>": "example.avro<
0000140222c20226669656c6473223a205b7b22>", "fields": [{"<
000016074797065223a2022737472696e67222c>type": "string",<
000020020226e616d65223a20226e616d65227d> "name": "name"}<
00002202c207b2274797065223a205b22696e74>, {"type": ["int<
0000240222c20226e756c6c225d2c20226e616d>", "null"], "nam<
000026065223a20226661766f726974655f6e75>e": "favorite_nu<
00003006d626572227d2c207b2274797065223a>mber"}, {"type":<
0000320205b22737472696e67222c20226e756c> ["string", "nul<
00003406c225d2c20226e616d65223a20226661>l"], "name": "fa<
0000360766f726974655f636f6c6f72227d5d7d>vorite_color"}]}<
00004000005f9a38098475462bf6895a2ab42ef>......GTb.h...B.<
000042024042c0c416c79737361008004020642>$.,.Alyssa.....B<
0000440656e0010000672656405f9a380984754>en....red.....GT<
000046062bf6895a2ab42ef24>b.h...B.$<
0000471

Deserialization:

# The schema is embedded in the data file
reader = DataFileReader(open("users.avro", "rb"), DatumReader())
for user in reader:
    print(user)
reader.close()

This outputs:

{u'favorite_color': None, u'favorite_number': 256, u'name': u'Alyssa'}
{u'favorite_color': u'red', u'favorite_number': 8, u'name': u'Ben'}

Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them:

Avro IDL

In addition to supporting JSON for type and protocol definitions, Avro includes experimental support for an alternative interface description language (IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++, Protocol Buffers and others.

The original Apache Avro logo was from the defunct British aircraft manufacturer Avro (originally A.V. Roe and Company).

The Apache Avro logo was updated to an original design in late 2023.

See also


This page was last updated at 2024-03-15 00:53 UTC. Update now. View original page.

All our content comes from Wikipedia and under the Creative Commons Attribution-ShareAlike License.


Top

If mathematical, chemical, physical and other formulas are not displayed correctly on this page, please useFirefox or Safari