Avro Schmea

Avro Supports both primitive and complex data types

Primitive data types

null, boolean, int, long, float, double, string, bytes

Complex data types

array – ordered collection of objects , all the objects in the array should be of same type , to know how to use array see this blog . ( all the complex type example shows how you can use the types in a record )


{

"name":"superHeros",

"type":{"type":"array",

"items":"string"}

}

map – Unordered collection of key-value pairs. Key should be of string type, value can be any type.with in a particular map all the values should be of same type

{

"name":"phoneModelWeightMap",

"type":{"type":"map",

"values":"float"}

}

record– Collection of named fields of any type . Check this blog for example.

enum – Similar to C style enum, set of named values

{
"name":"weekendDay",
"type":{
"type":"enum",

"name":"WeekendDay",

"symbols":["sunday","saturday"] }

}

fixed – fixed number of unsigned bytes

{
"name":"creditCardNumber",
"type":{
"name":"CreditCardNumber",

"type":"fixed",

"size":16}

}

Union – If you have a field that can have multiple types of values , then you have to use union to represent the multiple schemas . Union is represented by JSON array.

For example you want to represent user phone number, which may be a string or null. In that case you use the following schema

{

"name":"phoneNumber",

"type":["null","string"]

}

In case of Java and C++ , you can generate code to represent the data for an Arvo schema.

Every language that implements Avro API will have mapping between avro type and the types available in the language.

Java API supports the mapping in three ways

Generic mapping

Specific mapping

Reflect mapping

 

Schema resolution

Your reading schema doesn’t has to be same as that of the writing schema. You can add new fields or remove the existing fields.

If a new field is added to the reading schema, then you have to specify a default value , which will be used if the field is not present in the data.

{
"name":"Employee",
"type":"record",
"doc":"employee records",
"fields":[{
"name":"empId",
"type":"string"
},{
"name":"empName",
"type":"string"
},{
"name":"position",
"type":"string",
"default":"unknown"
}]
}

The position field is not present in the original schema. check this blog for example.

Projections: Read only few fields present in the data file. If you have too many fields present in your input data you can define a schema with the required fields. For example if you just want to read empName from the employee data , the following schema will work.

{
"name":"Employee",
"type":"record",
"doc":"employee records",
"fields":[{
"name":"empName",
"type":"string"
}]
}

check this blog for example code.

Add a Comment

Your email address will not be published. Required fields are marked *