Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pyrobuf: A Cython Alternative to Google's Python Protobuf Library (github.com/appnexus)
77 points by rileyberton on Dec 18, 2015 | hide | past | favorite | 35 comments


Thrift is a very similar system for message serialization to Protobuf.

The thriftpy project implements a pure Python parser for thrift files that dynamically generates Python modules without a build/codegen step. Also has Py2, Py3, and PyPy support.

Awesome library for this sort of thing. Includes a Cython impl of Thrift parsing/serialization, too:

https://github.com/eleme/thriftpy



What is the benefit of using something like Thrift to something like CapNProto or FlatBuffers ?


I suspect it comes down to the details of the serialization format.

I know that Thrift and Protobuf were developed / publicly released around the same time (~2008). They both have serialization and RPC approaches. They both have an IDL format and a compiler for codegen in static languages like C++ and Java.

Thrift was adopted by the Apache O/SS community and so it can be found in some Apache projects, e.g. Thrift is used as an interop format in Apache Storm and as a serialization backend supported by Apache Parquet.

Thrift has had a lot of client libraries for different languages come out over the years, so it tends to work everywhere.

IIRC, CapNProto and FlatBuffers were each built as fresh takes on protobuf. So I imagine they are similar, with IDLs and serialization approaches of their own. I took a quick look at CapNProto's Python module documentation and it looks very similar to thriftpy in philosophy, but I don't know its intricate details as I haven't used it.

This guide to Thrift covers all the important details of its schema approach and is meant as a cross-language guide. It should be the official docs, but alas, it's not!

https://diwakergupta.github.io/thrift-missing-guide/


If this gets python 3 support before google's official implementation I would be much more inclined to give it a shot, but it doesn't look like it's compatible yet either.


Try now - I just merged in some changes to make it Python 3 compatible.


If your looking for a speedy alternative to protobuf then you owe it to yourself to try flatbuffers.

https://google.github.io/flatbuffers/


Don't bother, go directly to cap'n proto: https://capnproto.org


There are a couple of reasons I chose FB over Can'n Proto, Python and full C++ doesn't work on MSVC/Windows (this might have changed recently especially given the Clang compiler integration into MS backend)

Second problem is Capnproto doesn't have structs, ie. you want something like struct Vec3 {x: float, y: float, z: float} in CapnProto you have to either make Vec3 a ref type (+ pointer size and lookup indirection for each instance) or inline it manually (especially annoying given that you must manually number CapnProto fields, maybe this is a benefit in some use cases but it's of no use to me and only creates work).

CapnProto is much bigger as well - it includes it's own RPC/distributed object protocol and library - the serialization part isn't separate they are in one library - that's taking in a lot of useless code that you would have to figure out if you decided to maintain it.

Also worth noting FlatBuffers on Python are actually not that fast because they are pure-python - there are Cython versions in the PR (at least last time I checked) - this doesn't matter in my use case but worth pointing out. And the Python API is very very unpythonic (C style API with even the naming completely off from PEP8) compared to CapnProto which is excellent !

I needed a cross language typed serialization format for binary files and FlatBuffers is better for that than CapnProto IMO.


> the serialization part isn't separate they are in one library

Although Cap'n Proto's C++ implementation is hosted in a single git repository, it does compile to several distinct libraries. You can use just the core serialization/deserialization part, libcapnp, if that's all you need. There are separate components, libcapnpc and libcapnp-rpc, for dynamic reflection and the object-capability remote procedure call system.


Lets play Cap'n proto documentation or Disney film catchphrase! 1. "If the pattern holds, there will be an infinite number of releases before the end of this month." 2. "Infinity Faster!" 3. "To infinity and beyond!" ɹɐǝʎʇɥƃıן zznq s,ʎǝusıp sı Ɛ :ɹǝsʍu∀


Last I looked cap'n proto didn't have any msvc support.


The last release added it (partially). "The core serialization functionality sufficient for 90% of users is available, but reflection and RPC APIs are not."


So Python API still doesn't work on Windows ? (AFAIK it's using reflection bindings from C++ ?)


More protobuf implementations is certainly good, Google's have been really really lacking for us.

It would be really nice if there was a pure Python implementation that didn't use tons of unnecessary metaprogramming.


It's really interesting to me that the top two comments on this story both wish Protocol Buffers were different, but seem to be asking for opposite things.

Your comment wishes that there was a pure-Python implementation that doesn't do metaprogramming. That would imply doing code generation, but having the generated code be more "concrete", so-to-speak (containing all of the actual implementation).

However, a different comment (https://news.ycombinator.com/item?id=10762101) admires the thriftpy project which doesn't have a codegen step at all. This is in some sense the opposite of what you want: everything becomes metaprogramming.

I work on Protocol Buffers at Google, and I frequently observe these kinds of opposing feelings about how protobufs ought to work. One thing I've learned is how incredibly difficult it is to please everybody. Another good example of this is whether to be pure-Python: some people (like you) want that. But it's nearly impossible to make it very fast, and lots of users are sensitive to speed (you can find various comments and blog articles complaining about the speed of Python protobuf).


I've certainly observed the same I think :) and yeah pleasing everyone is impossible -- I think though there's a somewhat coherent way to put at least part of those two things together.

Not wanting code generation might come from people who don't like the additional build step and artifact deployment. I don't like that either, but what I mean by "less metaprogramming" is that the code that's generated currently makes copious use of descriptors, metaclasses, and other "complex" things in a way that makes it extremely JIT-unfriendly on PyPy. We had to completely abandon it in favor of using protoc-c wrapped via CFFI. So I care about speed (almost entirely -- protobuf powers systems for us that do ~350K messages per second).

(I do though think it would be possible to satisfy both complaints simultaneously. I spent about 5 minutes trying to write a codegen-less, pure Python implementation: https://github.com/Julian/pb/tree/master/pb but I didn't find the Proto3 docs mature enough to figure out the binary protocol at the time. Not sure there's enough there to see what direction I was trying to go in, but I have given this a shot before).

And thanks for the reply, I do agree that there is some conflicting concerns involved here.


Ah, that's very interesting -- your desire for more "pure" Python comes from a desire for speed. (Some people who wish the generated code was more concrete want this so the generated code is more readable).

I am curious what led you to the conclusion that descriptors and metaclasses were making Python protobuf JIT-unfriendly. I ask because most of the "meta" stuff happens at startup, when you first import the module. It uses a metaclass to generate a bunch of Python methods, but after that they are just regular Python methods, and should be as easy for PyPy to optimize as anything else. There is a bit of reflection happening in some code-paths (like the __init__ method does loop over a list of fields to decide how to initialize them), but the metaclass at least is pretty much gone after import.

For this reason, I suspect that even if we ditched all the metaclass stuff, you'd see PyPy performance pretty close to what you're seeing now. The only thing I could see potentially making a more significant difference is if there were generated parsing code that switches on field number, instead of looking up fields by number in a dictionary. But given that Python doesn't have an actual switch statement, this might not be an improvement at all.


Hi

The python code that's generated is ok, but the problem is that everything is very dynamic. If you want this to work fast on PyPy, it should really generate a bunch of getters and setters that use attribute access and not go through generic __getattr__ that does some dictionary lookups. Additionally a lot of code is written in a way that creates a lot of temporary lists iterating over all fields and double checking stuff - this is bound to be way slower than a simple C/Java stuff that does the very simplest "check type - attribute access" sort of stuff.


I think we learned a long time ago that though codegen is one path to speedy code, it is not the only path.

The thriftpy module dynamically generates a Python module that is based on parsing the thrift schema. That happens once: at module import time. After that, Python has cached the module in sys.modules and it doesn't need to be evaluated again during the program's runtime.

The module contains efficient Python classes that are Python bytecode just the same way "compiled" classes would be. But, thriftpy has Cython implementation of Thrift protocols and transports that compile down to C, not Python bytecode. Thus, the thriftpy impl's are not only more dynamic and less cumbersome than the codegen alternative, but they are also faster.

In the Python community, we prefer to take other approaches to achieve execution speed.

Dynamically generating a Python module is not "metaprogramming". It is not black magic. You can generate a module yourself simply by instantiating a Python module type. Hooking the import statement is an officially supported part of the language, and done by many libraries.

This may all seem very worrying to a C++ or Java programmer, but in the Python community we have been doing dynamism with execution speed for about 10 years now, and we haven't regretted any of the results!


For what it's worth, I prefer the approach you describe. I have advocated for it for a long time. I've been writing C extensions to accelerate parsing in dynamic languages for 10 years, and I'm very familiar with this sort of dynamism.

It's interesting though that you describe this approach as more Pythonic, because one of the most outspoken critics of this approach is a hard-core Python guy that I work with who has lots of experience with the Python ecosystem. He feels very strongly that it is more Pythonic to have very flat/concrete generated code that is transparent to the reader, and really does not like the idea of hooking import and generating everything at runtime.

This is exactly what I am describing about how it is hard to please everybody. Different people appear to have fiercely different opinions about what is idiomatic.

When we wrote the Ruby protobuf implementation this year, we took an approach much more along the lines of what I personally prefer. The extension is mostly implemented in C, and it's very easy to build types at runtime. It doesn't directly import .proto files (which I would have preferred) because there was still some desire from others to have some kind of code generation. So the approach we took was to use a Ruby-like DSL for describing the protobuf schema. But really this is nothing but a translation of the .proto file into the Ruby DSL. ie. take the .proto file:

  syntax = "proto3";

  message Test {
    int32 foo = 1;
    double bar = 2;
    Test test = 3;
  }
The "generated code" for this is simply:

  require 'google/protobuf'
  
  Google::Protobuf::DescriptorPool.generated_pool.build do
    add_message "Test" do
      optional :foo, :int32, 1
      optional :bar, :double, 2
      optional :test, :message, 3, "Test"
    end
  end

  Test = Google::Protobuf::DescriptorPool.generated_pool.lookup("Test").msgclass


Hey Josh! Jorge from gRPC here. What's your take on Pyrobuf?


Reduce your objects and validate using Marshmallow , de(en)code however you like.


Is support for python worse than go or clojure? It seems unimaginable.


why do we keep reinventing the wheel and not use asn.1?


If you think asn.1 is the same as protocol buffers, you are missing the point.

Yes, asn.1 is the all signing, all dancing serialization framework. Protobuf's aren't. That isn't by accident.


The spec is too big, and it's too hard to implement correctly (let alone efficiently).


you are using it all the time without realizing it. http://www.marben-products.com/asn.1/market.html

protobuf, swift, etc happened when an advertising company thinks it is an engineering company


Perhaps they should do more marketing, because speaking as a C++ developer: I picked between protobufs, flatbuffers and capt'n proto because they were easy to use, had active communities, and they had websites which explained how and why I should use their compiler/library/protocol.

When I search the web for information about asn.1, I find very little that is of practical use. What library should I use? Why should I use it? How does it benchmark in comparison to the other tools? I've seen a few asn.1 library webpages and they all seem to take it as a given that I'm just looking for some way to deal with asn.1 data. They don't bother to try to convince me that their tool is efficient, or even that asn.1 is the right choice for my data in the first place.


you are dismissing an old and proven technology because you feel it is not marketing to you effectively, which is the backend to the global telecom system (ss7) for the past two decades? why is this a movement in computer science to throw out old things that work? this is confusing to me like nosql. but "good marketing" is something that is called cap'n proto, sounds like a joke?

if you are curious here is a good blog post https://ttsiodras.github.io/asn1.html

i am working with asn.1 the protocol MMS


Well, the only location many programmers nowadays stumble over ASN.1 is in certificates. And ASN.1 parsing from them has a history of massive security issues and results in massive warnings of "do not touch!".

And yes, "good marketing" in the form of readily available and well-documented libraries for the languages we use is a very important factor.

I bet the telecom sector has their battle-tested libraries for ASN.1, or at least for the parts they use. Are they open-source? Are they available for all languages wanted? No? Then why would I use ASN.1, just to use a "standard", if it means using worse code or writing it myself?


More broadly, he is talking about the community. That is what makes or breaks a technology today, and the community for asn.1 doesn't exist. It also lacks many of the features protobufs have, like versioning or associative map, not to mention it is ugly.


Sorry to disappoint, but I don't really consider myself a computer scientist. My degree was EE with a focus on communications. I then spent a five years shuffling bits between custom protocols for unmanned vehicles.

The data format isn't hard. It doesn't particularly impress me that asn.1 has worked for 20 years when I've seen hand-rolled formats do the same. The hard part of the process is defining the actual messages anyway.

The part that is valuable is the library and the tooling that makes efficient serialization easy. If it's not easy, it's not useful because I already know how to do it the hard way.

The marketing has a point, by the way. It's an indication that there are people who are willing to put in effort to get others to adopt their tooling. That's a good indication that when I have a problem, that there will be somebody else who's willing to put in effort to help me. Given that I have zero budget to pay for support, that's important.

Thanks for the link, though. There's definitely some useful information there.


ANS.1 is pretty popular for introducing bugs, Estonia recently had to bin hundreds of thousands of ID cards due to an encoding mistake.

Integers being encoded as signed is a nightmare for cryptography, which makes no use of them.


I don't think ASN.1 supports disjoint types, so that is a decent argument for capn proto's continued existence at least.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: