Sherpa.ai unveils private VFL alignment
- Sherpa.ai researchers posted an April 2026 paper describing a multi-party private set union protocol for vertical federated learning entity alignment without exposing overlaps. - The method adds two variants — order-preserving exact matching and unordered noisy matching — and the paper says it scales with low communication overhead. - The work targets banks, hospitals and telecoms that need shared training without revealing shared customers or patients. (arxiv.org)
Vertical federated learning lets companies train one model on different columns of the same people’s data, and Sherpa.ai says it has a new way to match those records without exposing who overlaps. (arxiv.org) The paper, posted to arXiv in April 2026 by Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Georgios Kellaris, Joaquin Del Rio, Oleksii Sliusarenko and Xabi Uribe-Etxebarria, describes a multi-party private set union protocol for that matching step. (arxiv.org) The matching problem comes before training starts: a bank may have income data, an insurer may have claims data, and both need to know which rows refer to the same person. Sherpa.ai calls that privacy-preserving entity alignment, or PPEA. (arxiv.org) (sherpa.ai) Most older approaches use private set intersection, which finds only the shared records. The paper argues that step leaks sensitive information because every participant learns which identifiers appear in common. (arxiv.org) (ieeexplore.ieee.org) Sherpa.ai’s alternative uses private set union instead: parties align over the full union of identifiers, not just the overlap. The stated goal is to hide intersection membership entirely, so a shared customer or patient is not revealed by the alignment process itself. (arxiv.org) (sherpa.ai) The paper gives two versions of the protocol. One preserves order for exact matches, and the other is unordered so it can tolerate typos and formatting differences in names or other identifiers. (arxiv.org) Sherpa.ai says the protocol extends beyond the two-party setups common in earlier work. The authors also say they prove correctness and privacy, and analyze both communication cost and exponentiation cost. (arxiv.org) The company’s blog points to use cases in multi-institution healthcare, bank-insurer risk modeling and telecom-finance fraud detection. Those are settings where organizations may want a shared model but cannot reveal raw records or even the fact that a person appears in both datasets. (sherpa.ai 1) (sherpa.ai 2) The paper is a research release, not a disclosed production deployment or customer launch. But it puts a specific design on a hard part of federated learning that has often been treated as a preprocessing detail rather than a privacy leak of its own. (arxiv.org)