-
Notifications
You must be signed in to change notification settings - Fork 72
Differential privacy of example. #1
Comments
This behavior is due to the In short, the flag allows elastic sensitivity to use constraints of the data model in order to eliminate invalid candidates from the set of neighboring databases that must be considered in sensitivity calculation. We describe this optimization in our paper in Section 3.6 ("Using Data Models for Tighter Bounds on Sensitivity"). If you change the flag
|
Thanks! I appreciate the information. I'm reading your text in section 3.6, and I still can't square it with differential privacy:
In this case I'm not worried about an adjacent dataset with duplicate For example, let's think about how well protected is the fact that my |
The optimization here actually requires slightly stronger assumptions: that the key is unique and the database is append-only, therefore the scenario you described falls outside the set of valid adjacent databases. This variant of the optimization was informed by several production systems at Uber with exactly these properties (e.g. logging systems, event streams), where records are only added and each protected record is identified by a unique key. Our code implements a flag for enabling this optimization in those situations where it applies; when enabled, it effectively allows analysts to join on unique keys from these records without being penalized for it. The example query is admittedly not an ideal showcase for the optimization -- as you correctly point out, in a typical setting customers may be added or deleted, and the optimization would thus be turned off in a real deployment. To avoid confusion we'll remove the optimization in the example code and rename the flag to clarify the assumptions being made. |
Thanks again! I'm afraid this still doesn't line up with differential privacy. If I submit my data with address The "adjacent" datasets you need to consider involve record removal as well as introduction. "Adjacent" does not mean "datasets that may result from further interaction", it means "hypothetical alternatives that I might worry you will distinguish between to others". Furthermore, differential privacy also doesn't really have a provision for "assumptions about the data", like that there is at most a certain multiplicity of certain keys. I understand that you might have those constraints in your production systems, but if you just ignore datasets violating these constraints you end up with a weaker privacy guarantee than differential privacy. In particular, you lose the "large step" corollary that
If your mechanism doesn't satisfy this property, it doesn't have differential privacy. I think it is a fair question to ask about the topology of the space of datasets, and allow the distances between two to be larger than the symmetric difference if the shortest path goes through infeasible space, but you would be getting pretty exotic here and differential privacy is pretty bound up in symmetric difference. What you can do, which is what PINQ does, is to have computations enforce the constraints. From arbitrary data, you can take the first 100 orders for each customer, ensuring that there are at most 100 orders. However, this introduces a factor of 2 in the privacy analysis (adding or removing an order could result in both the addition and removal of an order). |
Thanks for the clarification. This feature has been removed in 702578a. |
Cool, thanks. Don't forget that the |
I may be misunderstanding, but the com.uber.engsec.dp.util.DPExample example you have doesn't seem to correspond to differential privacy.
Obviously, it is hard to tell because
QUERY_RESULT
is just fabricated here, but if we are expected to view this as "run the query; imagine you got these results", it does not appear to have the property of differential privacy.It looks like you are assuming that
orders.customer_id
has maximum multiplicity one. Otherwise I could have acustomers
relation with one customer,FRANK
, and anorders
relation with 100,000 orders for each of my collection of Ted Cruz sex dolls. The reported results, all of which are tightly concentrated around 100,000 would reveal that I am present in thecustomers
relation, whereas had I withheld my record fromcustomers
the answers would be tightly concentrated around zero.I'm not sure how the rest of your pipeline works, but perhaps you could clean up the example or explain the rationale for why it has differential privacy for the stated query.
EDIT: Oho! You do have assumptions about the data, laid out in https://github.com/uber/sql-differential-privacy/blob/master/src/test/resources/schema.yaml. This would be super helpful to point out! :D What happens when these assumptions are violated?
EDIT2: It looks like in the
orders
relation thecustomer_id
field hasmaxFreq: 100
. So let's assume I order 100 Ted Cruz blow-up dolls. My presence or absence incustomers
will change the output by+/-100
, which it looks like would not be masked by your noise addition, which when I run produces results likeThese seem to be clearly 100,000 and certainly not 99,900, which is what the result would be without my hypothetical record. What am I missing?
The text was updated successfully, but these errors were encountered: