||BREAKFAST GATHERING & NETWORKING
||Welcome Remarks & Introductions
||KEYNOTE - Benjamin Treynor Sloss (Google) - Scaling Challenges in Infrastructure: Single Points of Failure, and Expensive Heuristics
At small scale, both engineering and decision shortcuts are often rational because they are quick and easy to implement. With scale, some shortcuts become risky and expensive. We examine two common shortcuts which become less desirable as scale increases: Single Points of Failure, and Decision Heuristics incorporating Zero and Infinity.
SERVICES, PROGRAMMING MODELS
- Marius Eriksen (GRAIL) - Simplicity in the Age of Clouds
Cloud computing offers powerful facilities that make for easy systems building: practitioners can solve a wide variety of problems by integrating existing infrastructure, with only a little application-specific code. The appeal of this approach is clear: we can build systems faster, using proven components that are usually operated by someone else.
However we should also be careful: When all we have is a cloud, then everything looks like it can be solved with 5 different services, 3 programming languages, some serverless computing, an object store, and an event stream. It’s all too easy to end up with systems that are far more complicated than they really should be.
In this talk, we’ll explore the design of Reflow, a cloud data processing system. Simplicity and parsimony are core to Reflow’s design, and we’ll talk about how we exploited these principles to create a powerful distributed data processing system, with a relatively small codebase and very few dependencies. While the talk is specifically about Reflow, most of its lessons are readily generalizable.
- Norman Mauer (Apple) - Scaling Netty based Services @ Apple
Netty is the underpinning for most network frameworks for the JVM, or is used directly by services which provide network connectivity and run on the JVM. In many of these cases, including Apple, Netty needs to operate at a massive scale and plays a critical role in the service's infrastructure.
To operate Netty based services at Apple's scale we heavily invest in the development of Netty. Some of the changes we made over the last years dramatically improved the performance and lowered the latency of Netty based solutions including Cassandra and web frameworks built on top of Netty.
This talk will demonstrate different techniques we use in Netty to achieve low latency and high throughput and review how we use Netty to scale our services and therefore our infrastructure; including what worked and lessons learned while making performance and scalability improvements.
- Peter Sbarski (A Cloud Guru) – Serverless in Practice
A Cloud Guru built one of the largest platforms using serverless technologies like AWS Lambda (we don’t run a single server, anywhere). It has been an incredible journey and we learnt a lot of lessons scaling our platform to 600,000+ users.
In this session we’ll share notes from our experience including:
- What our serverless system looks like 2 years in (including key metrics)
- Important design patterns and architectures
- Common serverless mistakes and how to avoid them
We will dive in to the design of our platform and share interesting data, go through key patterns and architectures, and discuss what it actually takes to build a scalable, reliable and high-performing cloud-native serverless systems today.
- Andrew Wilcox (Twitter) – Modern Compression at Twitter
The venerable zlib library has been the workhorse of general purpose compression for nearly three decades, offering reasonable throughput and compression ratio. Recent advances have led to new compression codecs with improved characteristics. We discuss the applications of ZStandard (Facebook) and Brotli (Google) at Twitter. We also discuss our experience with optimization of zlib and LZ4 via vectorization.
- Bo Liu (Pinterest) – Build a dynamic and responsive Pinterest on AWS: a system engineer’s perspective
In the past three years, Pinterest has evolved from a state where the majority of the product surfaces were powered by offline daily jobs to being fully dynamic and responsive. During the same period of time, the amount of the system traffic also greatly increased due to a combination of user growth and more complex products. To make our systems meet the requirements by these changes, we had to evolve the architecture of existing services, better leverage solutions provided by AWS, and build nearly ten brand new services. In this talk, we will discuss how Pinterest backend systems evolved in the past three years to power fully dynamic and responsive products. An overall picture of the current backend systems will be shared. We believe some of the systems solve common problems of distributing personalized content in a dynamic and responsive way. Technical details about these systems will be discussed. We will also share the experience we learned about how to better leverage AWS solutions to build more efficient and reliable backend systems.
||Exhibits Open - In Parallel to Session
STORAGE & DATA
- Tian-Ying Chang (Pinterest) - Goku: Pinterest’s in-house time series database
Goku is a highly scalable, cost-effective and high-performant online time series database service. It stores and serves massive amount of time series data without losing granularity. Goku can write tens of millions of data points per second and retrieve millions of data points within tens of milliseconds. It support high compression ratio, downsampling, interpolation, and multi-dimensional aggregation. It can be used in a wide range of monitoring tasks, including production safety, IoT, etc. It can also be used for real time analytics that make use of time series data.
- Amin Heydari (Box Inc.) - Scaling to an Exabyte Storage System in a Multi-Cloud Environment with Strict Governance Requirements
Modern Enterprise Content Management Systems have unique challenges to securely store and manage access rights to content that is continuously growing at an exponential rate. Many of the cloud providers enable the ability to store large amount of content with the same relative costs as traditional solutions. The challenge is how to leverage these cloud providers to efficiently transfer and store content in multiple geographically distributed regions to met compliance requirements.
This presentation will cover capacity analysis modeling for content migration from datacenter to cloud providers while maintaining the necessary compliance requirements.
- Ed Huang (PingCAP) – Designing Relational Databases in 2010’s: Trends and Methodology
The past ten years have witnessed the rapid development of the Internet and the popularity of distributed systems. A new field in distributed computing emerges: the convergence of the traditional relational databases and the distributed systems. In this talk, I will use TiDB as an illustration and introduce the trends and methodology of how to design modern databases that are both horizontally scalable and compatible with traditional relational databases.
- Jamie Turner (Dropbox) – Pushing Boundaries in Cloud Storage
Jamie Turner is the Director of Persistent Systems Engineering at Dropbox, where he leads the software platforms that store Dropbox customers’ data. These teams collectively build and run systems which hold many trillions of metadata records and exabytes of file data.
Prior to this management role, he was a Principal Engineer, performing various design, implementation, and technical leadership roles on Magic Pocket, Dropbox's block storage system. Additionally, he started an initiative to significantly reengineer Dropbox’s client-side sync engine.
Before Dropbox, Jamie was Head of Engineering at Bump Technologies, Inc--a mobile startup acquired by Google in 2013. Prior to Bump, he was VP of Information Services at YouGov, plc. In addition to industry work, he's authored several open source projects that have accumulated thousands of stars on GitHub, including the Stud SSL proxy.
- Dinesh Joshi (Apache Cassandra) – Need for speed: Boosting Apache Cassandra’s performance using Netty
Apache Cassandra 4.0 has several enhancements. One of the biggest enhancements is switching from blocking network IO using JDK Sockets to Non Blocking IO with Netty. As a result, Cassandra has seen gains in performance and efficiency. These gains translate into real world costs and allow Cassandra to scale better. This presentation will take you on a tour of the improvements of Cassandra's network layer (old & new) and help quantify the gains in real world terms.
- Vinu Charanya (Twitter) - How Twitter built a Framework to improve Infrastructure Efficiency at Scale
Twitter serves over 320 million users in a month and at any given time processes over tens of millions tweets. It is powered by thousands of microservices that run on our internal Cloud platform which consists of a suite of multi-tenant platform services that offer Compute, Storage, Messaging, Monitoring, etc as a service. The multi-tenancy nature of these services makes it extremely difficult to track ownership of resources, effectively forecast capacity and meter resource utilization & cost. We built Kite, a service lifecycle management system aims to address these problems. In this talk, I’ll talk about how Kite as a system generalizes the service ownership tracking across Twitter’s disparate infrastructure & platform services. The ownership models are being extended to help manage resource provisioning, track operational metadata and credential management for services at Twitter. Kite has also enabled engineers @ Twitter to better visualize resources utilization and spend, plan capacity and empower them to make decisions around investing in features v/s infrastructure efficiency.
PANEL SESSION - Panel Chair: Jennifer Fraser (Twitter) - How to manage failures at scale and lessons learned
The focus of this panel is to discuss lessons learned, decisions and investment that each of the panelists have seen companies have incorporated when dealing with growth, scalability and reliability of their platform.
- Arya Asemandar (LightStep)
- Berk D. Demir (Facebook)
- Scott Engstrom (Oracle)
- Sabrina Farmer (Google)
- Jennifer Fraser (Twitter)
SECURITY & TRUST
- Raymond W. Blaine (West Point) - Securing Infrastructure’s Industrial Control Systems (ICS)
The United States Army is a microcosm of the country and has many, if not all of the same functions and challenges. These functions and challenges extend to cybersecurity and its roll in defending infrastructure. As the lead for one of the first cyber protection teams to attempt to tackle infrastructure security for the DoD I came across significant issues in regards to tools, personnel (team compositino, talent, retention, etc.), as well as typical organizational inertia. My team was able to overcome many of these challenges, but there are significant obstacles and opportunities for the larger community to tackle. The lessons and issues discussed in this talk are relevant to any large organization with a significant physical and logical footprint.
||Exhibits Open - In Parallel to Session
AFTERNOON SESSION (continues)
SOFT SKILLS: CULTURE, EXPERIENCE, SERVICES
- Nitzan Blouin (WeWork) - Five Simple Steps for Changing an Engineering Culture
Changing an engineering culture is one of the biggest challenges an engineering leader faces through her career; whether it’s scaling and hiring, delivering in a hyper-growth setting, infrastructure migration or adopting different processes and methodologies. Change is scary, even if it is for the best.
How do you get culture change done day in and day out when the finish line is years away? How do you justify the hefty price tag attached to your vision?
- Here Comes The Sun: Be inspired by your vision. You are bringing to life a promised land where all the things will be solved. I dare say that the differentiating factor between success and failure depends on this step.
- With A Little Help From My Friends: Gather the data. Surveys can sometimes be dry, but they are a great way to start a conversation, especially if your organization has hundreds of engineers. A survey will clarify your understanding of the problem; furthermore it will highlight the gap between how the organization perceives itself, and where it is at.
- Come Together: You can’t do it alone. If part of your strategy includes hiring then here is your chance to bring in the best partners in crime who’ll execute your strategy. If you are not hiring, then find your allies in the existing organization. Better yet, do both.
- Twist and Shout: Cultural change takes time, and often the closer you get to the finish line, the farther away it seems. Celebrate wins so they will keep you inspired and keep your vision alive.
- Getting Better (All The Time) : The tactical implementation of your strategy will change from team to team, and over time, one size simply doesn’t fit all. Present the right context to the right audience. Account for changes in the organization, and in the tech stack, so your strategy is remains relevant.
Let’s make something clear - simple is not easy. Cultural change requires an understanding of a large organization in a short amount of time, pinpointing root causes and devising a compelling strategy that is not only tactical but also inspiring. Then there is execution: metrics, armwrestling for headcount, process changes, tools’ adoption, and so forth. The steps in this talk will demystify this complex process and you will take away actionable ways to change the culture in your own organization.
- Dathan Vance Pattishall (WeWork) – WeWork K8s from nothing to happy customer
Delivering infrastructure to engineers requires more than just great engineering. WeWork’s Service Infrastructure team embraced product-focused processes in order to provide a fantastic Kubernetes experience that engineers love. Learn about how we brought cutting-edge deployment practices to the whole organization by interviewing stakeholders, communicating the pain points to team members, using a crawl-walk-run approach, and the tradeoffs of delivering an MVP instead of everything at once.
- Randy Shoup (WeWork) – Service Architectures at Scale
Over time, almost all large, well-known web sites have evolved their architectures from an early monolithic application to a loosely-coupled ecosystem of polyglot microservices. This presentation discusses what a healthy microservice ecosystem looks and feels like to the service owner, based on the speaker’s experience building and operating services at Google and eBay.
The presentation starts with the considerations in designing services to interact effectively with clients and dependencies, including interface stability in the face of evolving functionality, as well as some patterns and anti-patterns for service design.
It next describes the service ecosystem itself - how service teams relate to one another, how to encourage -- but not enforce -- standardization, and how to grow an ecosystem without central control.
It concludes by discussing operating services at scale, including the challenges of zero-downtime deployments and data migration.
LINUX, SYSTEMS MANAGEMENT
- Keith Packard (Hewlett Packard Enterprise) – Gen-Z Support for Linux
Gen-Z is an open systems Interconnect designed to provide memory semantic access to data and devices via direct-attached, switched or fabric topologies. Support for this will require additional kernel infrastructure, device drivers and management systems. This presentation will first briefly describe the Gen-Z interconnect, outline the management required for Gen-Z to operate and finally present an architecture for supporting Gen-Z natively in Linux.
- Lohit Vijay & Zhenzhao Wang (Twitter) – Large Scale Event Log Management @ Twitter
Log files consisting of events from different services are a rich source of information for large scale analytics. Events can be as simple as log line or as complex as nested structured objects like thrift or protobuffers. At Twitter every service generates events within a particular “category” and publishes them to the Event Log Management framework. The framework aggregates events from the same category into log files, usually stored on a distributed file system like the Hadoop Distributed File System (HDFS). In this presentation we provide an overview of the Event Log Management framework used at Twitter, highlight its advantages, and compare it with similar frameworks. Twitter services generate trillions of events with aggregate size exceeding multiple petabytes of data every day. The main focus of our efforts has been efficient use of hardware resources, scalability and reliability of the framework. We present our framework as a set of modular components which has scaled for event log management at Twitter.
- Ruben Oanta (Twitter) – Deterministic Aperture: An algorithm for non-cooperative, client-side load balancing
Twitter's RPC framework (Finagle) employs non-cooperative, client-side load balancing. That is, clients make load balancing decisions independently. Although this architecture has served Twitter well, it also has its drawbacks. In particular, it results in a mesh network where clients open connections to all backend servers in order to maintain a globally equitable load distribution. This scales poorly with larger clusters for a number of reasons. In this talk, we will dive deeper into the problem space and how we mitigate some of the drawbacks via an algorithm we call "Deterministic Aperture".
- Sandy Strong (Twitter) – Capacity Planning for Twitter’s Adaptive Ad Server
The Ad Server is Twitter’s revenue engine. It performs ad matching, scoring, and serving at an immense scale. The goal for our ads serving system is to serve queries at Twitter-scale without buckling under load spikes, find the best possible ad for every query, and utilize our resources optimally at all times. This talk will provide an overview of capacity planning strategies and lessons learned that have proven to be valuable in achieving these goals.
- Ihab Hamadi (Western Digital) – Driving the Future of Data Infrastructure for High-Scale Data Centers
The exponential growth in data is not only fueling new Big Data and Fast Data applications, it is also creating complexities in the way that data is being captured, preserved, accessed and transformed.
New NVMe™ over Fabrics (NVMe-oF) devices will soon be available from multiple vendors. These new fabric devices will require additional management procedures above and beyond what has been historically used.
The OpenFlex™ architecture disaggregates flash and disk, compute and network resources into independent, scalable pools that can be connected via common networking technologies, such as Ethernet. The open Kingfish™ API enables these pools to be presented as software composable infrastructure that can be quickly and easily orchestrated to precisely address the needs of complex and dynamic applications and data workflows. Various KingFish™ tools will also be contributed to open-source.
- Prasanth Pulavarthi (Microsoft) – ONNX: Open Ecosystem for Interoperable and Scalable AI
Traditionally each machine learning framework has been its own silo. Data scientists needed to create, train, and deploy models with the framework and its proprietary tools. Many times, deploying with the framework was not practical and much development effort had to be spent to convert it to another framework or format. To remedy this situation, Microsoft co-created Open Neural Network Exchange (ONNX). ONNX provides a common format that enables frameworks (Caffe2, Chainer, CNTK, MXNet, PyTorch, SciKitLearn, TensorFlow) to produce models that can be deployed in a variety of devices and platforms (Azure, Windows, Linux, iOS/CoreML, Android, etc.) ONNX is an open industry effort with participation from many companies in the industry.
We will share the motivations for a cross-industry partnership, the current technical state and roadmap for ONNX, how ONNX is supported at Microsoft, and show how to go from training to deployment using ONNX models.
- Natalia Vassilieva (Hewlett Packard Enterprise) – HPE Deep Learning Cookbook: How to Choose an Optimal Infrastructure for Deep Learning Workloads
There is a vibrant and fast growing ecosystem of software and hardware for deep learning. Various deep learning frameworks are available for anyone who wants to try out this technology for their needs. With the variety of choices in hardware configurations and software packages, it is hard to pick the most optimal tools. The effectiveness of any hardware/software environment varies depending on a deep learning workload. HPE Deep Learning Cookbook is a set of tools to guide the choice of the most optimal hardware/software configurations for any given deep learning application. It is based on extensive benchmarking of reference workloads and performance modelling. The talk will present two main components of the cookbook toolset and general learnings on infrastructure requirements and performance of deep learning applications.
||KEYNOTE - Danny Lange (Unity Technologies) - From Cloud to Cyber-physical Systems, AI, and Beyond
From my first homemade computer to pushing the boundaries of Artificial Intelligence (AI) with the world’s most popular real-time 3D simulation engine, I have been fortunate to always have had the opportunity to work at the very forefront of computing infrastructure and its application. My work on the Unity platform blurs the lines between physical reality and the virtual. Not only does it reach 1.5 Billion people using Apps based on the Unity engine, but through massive cloud-based simulations we are able to train AI on scenarios that easily exceeds humanity’s collective experience in specific areas. Tracing the technologies I have worked on seems to depict a forward looking trajectory that carries a lot of promise in creating a better world for all, but it also raises some responsibilities that we as engineers need to take seriously. I see computer infrastructure rapidly spreading its tentacles deeper into the physical world while concurrently expanding into the virtual as well. The integration of large-scale computational and physical components enables new revolutionary scenarios ranging from self-driving cars over automated farming and healthcare to even scientific work itself. Cyber-physical systems with AI that eliminates the human-in-the-loop may soon take care of the world we live in. I see the most intriguing technical challenges ahead, but I also see significant issues in the area of ethics, security, reliability, and dependability of these systems.
||SOCIAL & NETWORKING