在分析狮中做攻城狮 系列文章
这个系列我会将在我在分析部分作为研发工作的日常,技术和思考发出来,供大家参考.

Why

为什么使用Protocol Buffers, 思考这个问题:
How do you serialize and retrieve structured data?

  1. Use Python pickling. This is the default approach since it’s built into the language, but it doesn’t deal well with schema evolution, and also doesn’t work very well if you need to share data with applications written in C++ or Java.
  2. You can invent an ad-hoc way to encode the data items into a single string – such as encoding 4 ints as “12:3:-23:67”. This is a simple and flexible approach, although it does require writing one-off encoding and parsing code, and the parsing imposes a small run-time cost. This works best for encoding very simple data.
  3. Serialize the data to XML. This approach can be very attractive since XML is (sort of) human readable and there are binding libraries for lots of languages. This can be a good choice if you want to share data with other applications/projects. However, XML is notoriously space intensive, and encoding/decoding it can impose a huge performance penalty on applications. Also, navigating an XML DOM tree is considerably more complicated than navigating simple fields in a class normally would be.

What

Protocol buffers are the flexible, efficient, automated solution to solve exactly this problem. With protocol buffers, you write a .proto description of the data structure you wish to store. From that, the protocol buffer compiler creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format.

How

Simple - Defining Protocol Format

tutorial.proto

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package tutorial;

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;

enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}

message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}

repeated PhoneNumber phones = 4;
}

message AddressBook {
repeated Person people = 1;
}

men.proto

1
2
3
4
5
import "tutorial.proto"

message men{
required Person person = 1;
}

说明:

  • Field modifiers:

    • required
    • optional
    • repeated
  • Scalar Value Types (See the table below for details)

.proto Type Notes C++ Type Java Type Python Type[2] Go Type
double double double float *float64
float float float float *float32
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int *int32
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long[3] *int64
uint32 Uses variable-length encoding. uint32 int[1] int/long[3] *uint32
uint64 Uses variable-length encoding. uint64 long[1] int/long[3] *uint64
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int *int32
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long[3] *int64
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 228. uint32 int[1] int *uint32
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 256. uint64 long[1] int/long[3] *uint64
sfixed32 Always four bytes. int32 int int *int32
sfixed64 Always eight bytes. int64 long int/long[3] *int64
bool bool boolean bool *bool
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode[4] *string
bytes May contain any arbitrary sequence of bytes. string ByteString str []byte

[1] In Java, unsigned 32-bit and 64-bit integers are represented using their signed counterparts, with the top bit simply being stored in the sign bit.

[2] In all cases, setting values to a field will perform type checking to make sure it is valid.

[3] 64-bit or unsigned 32-bit integers are always represented as long when decoded, but can be an int if an int is given when setting the field. In all cases, the value must fit in the type represented when set. See [2].

[4] Python strings are represented as unicode on decode but can be str if an ASCII string is given (this is subject to change).

  • 赋值的1,2,3 为 unique numbered tag

    • Tags in the range 1 through 15 take one byte to encode
    • Tags in the range 16 through 2047 take two bytes
    • The smallest tag number you can specify is 1, and the largest is 229 - 1, or 536,870,911. You also cannot use the numbers 19000 through 19999 (FieldDescriptor::kFirstReservedNumber through FieldDescriptor::kLastReservedNumber)
  • Defauil Value [ default = HOME ]

Compiling Protocol buffers

安装编译器

Github Release下载最新release, 推荐使用c++ compiler

安装:

1
2
3
4
5
$ ./configure
$ make
$ make check
$ sudo make install
$ sudo ldconfig # refresh shared library cache.

测试:

1
2
$ protoc 
Missing input file.

编译文件

1
2
3
$ protoc --python_out=. tutorial.proto 
$ ls
tutorial_pb2.py tutorial.proto

The Protocol Buffer API

Standard Message Methods

  • IsInitialized(): checks if all the required fields have been set.
  • str(): returns a human-readable representation of the message, particularly useful for debugging. (Usually invoked as str(message) or print message.)
  • CopyFrom(other_msg): overwrites the message with the given message’s values.
  • Clear(): clears all the elements back to the empty state.

Parsing and Serialization

  • SerializeToString(): serializes the message and returns it as a string. Note that the bytes are binary, not text; we only use the str type as a convenient container.
  • ParseFromString(data): parses a message from the given string.