||BREAKFAST GATHERING & NETWORKING
||KEYNOTE - Jennifer Fraser (Twitter)
||KEYNOTE - Benjamin Treynor Sloss (Google)
SECURITY & TRUST
- Raymond W. Blaine (West Point) - Securing Infrastructure’s Industrial Control Systems (ICS)
The United States Army is a microcosm of the country and has many, if not all of the same functions and challenges. These functions and challenges extend to cybersecurity and its roll in defending infrastructure. As the lead for one of the first cyber protection teams to attempt to tackle infrastructure security for the DoD I came across significant issues in regards to tools, personnel (team compositino, talent, retention, etc.), as well as typical organizational inertia. My team was able to overcome many of these challenges, but there are significant obstacles and opportunities for the larger community to tackle. The lessons and issues discussed in this talk are relevant to any large organization with a significant physical and logical footprint.
- Stefan Nastic (Reinvent) – A Hybrid-computing Platform for Managing Trustless Interactions In Sharing Economy Systems
Over the last decade, sharing economy has reached a global scale. Recent technological advances have enabled widening of the "circle of trust" by increasing the volume and geographical scale of possible interactions among the participants of the sharing economy. Unfortunately, current implementations of sharing economy platforms have several limiting factors, the most significant being the centralization of trust. The centralized trust goes against the inherently decentralized nature of the sharing economy, hence limiting the possibilities for a wider adoption, beyond specific domain verticals. In this talk, we provide an overview of the WeValue platform, which is currently being developed by Reinvent – a Vienna University of Technology spin-off company. WeValue is a facilitating platform for the sharing economy that provides the concept of decentralized trust at its core. We present the most important features of the WeValue platform and focus on the core technical challenges of building a hybrid-computing platform, which combines traditional centralized computing infrastructures with the Blockchain-based decentralized computing infrastructure. Finally, we outline a particular use case of utilizing the WeValue approach to realize resource and data sharing for IoT systems.
STORAGE & DATA
- Garth Booth & Amin Heydar (Box Inc.) - Scaling to an Exabyte Storage System in a Multi-Cloud Environment with Strict Governance Requirements
Modern Enterprise Content Management Systems have unique challenges to securely store and manage access rights to content that is continuously growing at an exponential rate. Many of the cloud providers enable the ability to store large amount of content with the same relative costs as traditional solutions. The challenge is how to leverage these cloud providers to efficiently transfer and store content in multiple geographically distributed regions to met compliance requirements.
This presentation will cover capacity analysis modeling for content migration from datacenter to cloud providers while maintaining the necessary compliance requirements.
- Ed Huang (PingCAP) – Designing Relational Databases in 2010’s: Trends and Methodology
The past ten years have witnessed the rapid development of the Internet and the popularity of distributed systems. A new field in distributed computing emerges: the convergence of the traditional relational databases and the distributed systems. In this talk, I will use TiDB as an illustration and introduce the trends and methodology of how to design modern databases that are both horizontally scalable and compatible with traditional relational databases.
- Jamie Turner (Dropbox)
- Dinesh Joshi (Apache Cassandra) – Need for speed: Boosting Apache Cassandra’s performance using Netty
Apache Cassandra 4.0 has several enhancements. One of the biggest enhancements is switching from blocking network IO using JDK Sockets to Non Blocking IO with Netty. As a result, Cassandra has seen gains in performance and efficiency. These gains translate into real world costs and allow Cassandra to scale better. This presentation will take you on a tour of the improvements of Cassandra's network layer (old & new) and help quantify the gains in real world terms.
- Tian-Ying Chang (Pinterest) - Goku: Pinterest’s in-house time series database
Goku is a highly scalable, cost-effective and high-performant online time series database service. It stores and serves massive amount of time series data without losing granularity. Goku can write tens of millions of data points per second and retrieve millions of data points within tens of milliseconds. It support high compression ratio, downsampling, interpolation, and multi-dimensional aggregation. It can be used in a wide range of monitoring tasks, including production safety, IoT, etc. It can also be used for real time analytics that make use of time series data.
PANEL SESSION - How to manage failures at scale and lessons learned
The focus of this panel is to discuss lessons learned, decisions and investment that each of the panelists have seen companies have incorporated when dealing with growth, scalability and reliability of their platform.
- Arya Asemandar (LightStep)
- Berk D. Demir (Facebook)
- Sabrina Farmer (Google)
KEYNOTE - Raffi Krikorian (DNC) - Infrastructure building and constitutionally mandated ship dates
Infrastructure building and constitutionally mandated ship dates - infrastructure building is the highest leverage thing to do in a political environment. Having this foundation in place allows campaigns to build up specific technology to reach their constituency quickly. But, creating this platform is fraught with challenges: supporting incumbent forces and teams, training new staffers, cyclical finances, deferred maintenance, a culture that has been trained to "hero" results, and constitutionally mandated ship dates.
SERVICES, PROGRAMMING MODELS
- Marius Eriksen (GRAIL) - Simplicity in the Age of Clouds
Cloud computing offers powerful facilities that make for easy systems building: practitioners can solve a wide variety of problems by integrating existing infrastructure, with only a little application-specific code. The appeal of this approach is clear: we can build systems faster, using proven components that are usually operated by someone else.
However we should also be careful: When all we have is a cloud, then everything looks like it can be solved with 5 different services, 3 programming languages, some serverless computing, an object store, and an event stream. It’s all too easy to end up with systems that are far more complicated than they really should be.
In this talk, we’ll explore the design of Reflow, a cloud data processing system. Simplicity and parsimony are core to Reflow’s design, and we’ll talk about how we exploited these principles to create a powerful distributed data processing system, with a relatively small codebase and very few dependencies. While the talk is specifically about Reflow, most of its lessons are readily generalizable.
- Norman Mauer (Apple) - Scaling Netty based Services @ Apple
Netty is the underpinning for most network frameworks for the JVM, or is used directly by services which provide network connectivity and run on the JVM. In many of these cases, including Apple, Netty needs to operate at a massive scale and plays a critical role in the service's infrastructure.
To operate Netty based services at Apple's scale we heavily invest in the development of Netty. Some of the changes we made over the last years dramatically improved the performance and lowered the latency of Netty based solutions including Cassandra and web frameworks built on top of Netty.
This talk will demonstrate different techniques we use in Netty to achieve low latency and high throughput and review how we use Netty to scale our services and therefore our infrastructure; including what worked and lessons learned while making performance and scalability improvements.
- Peter Sbarski (A Cloud Guru) – Serverless in Practice
A Cloud Guru built one of the largest platforms using serverless technologies like AWS Lambda (we don’t run a single server, anywhere). It has been an incredible journey and we learnt a lot of lessons scaling our platform to 600,000+ users.
In this session we’ll share notes from our experience including:
- What our serverless system looks like 2 years in (including key metrics)
- Important design patterns and architectures
- Common serverless mistakes and how to avoid them
We will dive in to the design of our platform and share interesting data, go through key patterns and architectures, and discuss what it actually takes to build a scalable, reliable and high-performing cloud-native serverless systems today.
- Andrew Wilcox (Twitter) – Modern Compression at Twitter
The venerable zlib library has been the workhorse of general purpose compression for nearly three decades, offering reasonable throughput and compression ratio. Recent advances have led to new compression codecs with improved characteristics. We discuss the applications of ZStandard (Facebook) and Brotli (Google) at Twitter. We also discuss our experience with optimization of zlib and LZ4 via vectorization.
- Bo Liu (Pinterest) – Build a dynamic and responsive Pinterest on AWS: a system engineer’s perspective
In the past three years, Pinterest has evolved from a state where the majority of the product surfaces were powered by offline daily jobs to being fully dynamic and responsive. During the same period of time, the amount of the system traffic also greatly increased due to a combination of user growth and more complex products. To make our systems meet the requirements by these changes, we had to evolve the architecture of existing services, better leverage solutions provided by AWS, and build nearly ten brand new services. In this talk, we will discuss how Pinterest backend systems evolved in the past three years to power fully dynamic and responsive products. An overall picture of the current backend systems will be shared. We believe some of the systems solve common problems of distributing personalized content in a dynamic and responsive way. Technical details about these systems will be discussed. We will also share the experience we learned about how to better leverage AWS solutions to build more efficient and reliable backend systems.
AFTERNOON SESSION (continues)
- Nitzan Blouin (WeWork) - Five Simple Steps for Changing an Engineering Culture
Changing an engineering culture is one of the biggest challenges an engineering leader faces through her career; whether it’s scaling and hiring, delivering in a hyper-growth setting, infrastructure migration or adopting different processes and methodologies. Change is scary, even if it is for the best.
How do you get culture change done day in and day out when the finish line is years away? How do you justify the hefty price tag attached to your vision?
- Here Comes The Sun: Be inspired by your vision. You are bringing to life a promised land where all the things will be solved. I dare say that the differentiating factor between success and failure depends on this step.
- With A Little Help From My Friends: Gather the data. Surveys can sometimes be dry, but they are a great way to start a conversation, especially if your organization has hundreds of engineers. A survey will clarify your understanding of the problem; furthermore it will highlight the gap between how the organization perceives itself, and where it is at.
- Come Together: You can’t do it alone. If part of your strategy includes hiring then here is your chance to bring in the best partners in crime who’ll execute your strategy. If you are not hiring, then find your allies in the existing organization. Better yet, do both.
- Twist and Shout: Cultural change takes time, and often the closer you get to the finish line, the farther away it seems. Celebrate wins so they will keep you inspired and keep your vision alive.
- Getting Better (All The Time) : The tactical implementation of your strategy will change from team to team, and over time, one size simply doesn’t fit all. Present the right context to the right audience. Account for changes in the organization, and in the tech stack, so your strategy is remains relevant.
Let’s make something clear - simple is not easy. Cultural change requires an understanding of a large organization in a short amount of time, pinpointing root causes and devising a compelling strategy that is not only tactical but also inspiring. Then there is execution: metrics, armwrestling for headcount, process changes, tools’ adoption, and so forth. The steps in this talk will demystify this complex process and you will take away actionable ways to change the culture in your own organization.
- Dathan Vance Pattishall (WeWork) – WeWork K8s from nothing to happy customer
Delivering infrastructure to engineers requires more than just great engineering. WeWork’s Service Infrastructure team embraced product-focused processes in order to provide a fantastic Kubernetes experience that engineers love. Learn about how we brought cutting-edge deployment practices to the whole organization by interviewing stakeholders, communicating the pain points to team members, using a crawl-walk-run approach, and the tradeoffs of delivering an MVP instead of everything at once.
- Julien LeDem (WeWork) – From Flat Files to Deconstructed Database: The Evolution and Future of the Big Data Ecosystem
Over the past ten years the Big Data infrastructure has evolved from flat files lying down in a distributed file system to a more efficient ecosystem and is turning into a fully deconstructed database.
With Hadoop, we started from a system that was good at looking for a needle in a haystack using snowplows. We had a lot of horsepower and scalability but lacked the subtlety and efficiency of relational databases.
Since Hadoop provided the ultimate flexibility compared to the more constrained and rigid RDBMSes we didn’t mind and plowed through. Machine Learning, Recommendations, Matching, Abuse detection and in general data driven products require a more flexible infrastructure. Over time we started applying everything that had been known to the Database world for decades to this new environment. We'd been told loud enough how Hadoop was a huge step backwards.
And it was true to some degree. The key difference was the flexibility of the Hadoop stack. There are many highly integrated components in a relational database and decoupling them took some time.
Today we see the emergence of key components (optimizer, columnar storage, in-memory representation, table abstraction, batch and streaming execution) as standards that provide the glue between the options available to process, analyze and learn from our data.
We’ve been deconstructing the tightly integrated Relational database into flexible reusable open source components. Storage, compute, multi-tenancy, batch or streaming execution are all decoupled and can be modified independently to fit every use case.
This talk will go over key open source components of the Big Data ecosystem (including Apache Calcite, Parquet, Arrow, Avro, Kafka, Batch and Streaming systems) and will describe how they all relate to each other and make our Big Data ecosystem more of a database and less of a file system. Parquet is the columnar data layout to optimize data at rest for querying. Arrow is the in-memory representation for maximum throughput execution and overhead-free data exchange. Calcite is the optimizer to make the most of our infrastructure capabilities. We’ll discuss the emerging components that are still missing or haven’t become standard yet to fully materialize the transformation to an extremely flexible database that lets you innovate with your data.
LINUX, SYSTEMS MANAGEMENT
- Keith Packard (Hewlett Packard Enterprise) – Gen-Z Support for Linux
Gen-Z is an open systems Interconnect designed to provide memory semantic access to data and devices via direct-attached, switched or fabric topologies. Support for this will require additional kernel infrastructure, device drivers and management systems. This presentation will first briefly describe the Gen-Z interconnect, outline the management required for Gen-Z to operate and finally present an architecture for supporting Gen-Z natively in Linux.
- Lohit Vijay & Zhenzhao Wang (Twitter) – Large Scale Event Log Management @ Twitter
Log files consisting of events from different services are a rich source of information for large scale analytics. Events can be as simple as log line or as complex as nested structured objects like thrift or protobuffers. At Twitter every service generates events within a particular “category” and publishes them to the Event Log Management framework. The framework aggregates events from the same category into log files, usually stored on a distributed file system like the Hadoop Distributed File System (HDFS). In this presentation we provide an overview of the Event Log Management framework used at Twitter, highlight its advantages, and compare it with similar frameworks. Twitter services generate trillions of events with aggregate size exceeding multiple petabytes of data every day. The main focus of our efforts has been efficient use of hardware resources, scalability and reliability of the framework. We present our framework as a set of modular components which has scaled for event log management at Twitter.
- Ruben Oanta (Twitter) – Deterministic Aperture: An algorithm for non-cooperative, client-side load balancing
Twitter's RPC framework (Finagle) employs non-cooperative, client-side load balancing. That is, clients make load balancing decisions independently. Although this architecture has served Twitter well, it also has its drawbacks. In particular, it results in a mesh network where clients open connections to all backend servers in order to maintain a globally equitable load distribution. This scales poorly with larger clusters for a number of reasons. In this talk, we will dive deeper into the problem space and how we mitigate some of the drawbacks via an algorithm we call "Deterministic Aperture".
- Sandy Strong (Twitter) – Capacity Planning for Twitter’s Adaptive Ad Server
The Ad Server is Twitter’s revenue engine. It performs ad matching, scoring, and serving at an immense scale. The goal for our ads serving system is to serve queries at Twitter-scale without buckling under load spikes, find the best possible ad for every query, and utilize our resources optimally at all times. This talk will provide an overview of capacity planning strategies and lessons learned that have proven to be valuable in achieving these goals.
- Prasanth Pulavarthi (Microsoft) – ONNX: Open Ecosystem for Interoperable and Scalable AI
Traditionally each machine learning framework has been its own silo. Data scientists needed to create, train, and deploy models with the framework and its proprietary tools. Many times, deploying with the framework was not practical and much development effort had to be spent to convert it to another framework or format. To remedy this situation, Microsoft co-created Open Neural Network Exchange (ONNX). ONNX provides a common format that enables frameworks (Caffe2, Chainer, CNTK, MXNet, PyTorch, SciKitLearn, TensorFlow) to produce models that can be deployed in a variety of devices and platforms (Azure, Windows, Linux, iOS/CoreML, Android, etc.) ONNX is an open industry effort with participation from many companies in the industry.
We will share the motivations for a cross-industry partnership, the current technical state and roadmap for ONNX, how ONNX is supported at Microsoft, and show how to go from training to deployment using ONNX models.
- Natalia Vassilieva (Hewlett Packard Enterprise) – HPE Deep Learning Cookbook: How to Choose an Optimal Infrastructure for Deep Learning Workloads
There is a vibrant and fast growing ecosystem of software and hardware for deep learning. Various deep learning frameworks are available for anyone who wants to try out this technology for their needs. With the variety of choices in hardware configurations and software packages, it is hard to pick the most optimal tools. The effectiveness of any hardware/software environment varies depending on a deep learning workload. HPE Deep Learning Cookbook is a set of tools to guide the choice of the most optimal hardware/software configurations for any given deep learning application. It is based on extensive benchmarking of reference workloads and performance modelling. The talk will present two main components of the cookbook toolset and general learnings on infrastructure requirements and performance of deep learning applications.
KEYNOTE - Danny Lange (Unity Technologies) - From Cloud to Cyber-physical Systems, AI, and Beyond
From my first homemade computer to pushing the boundaries of Artificial Intelligence (AI) with the world’s most popular real-time 3D simulation engine, I have been fortunate to always have had the opportunity to work at the very forefront of computing infrastructure and its application. My work on the Unity platform blurs the lines between physical reality and the virtual. Not only does it reach 1.5 Billion people using Apps based on the Unity engine, but through massive cloud-based simulations we are able to train AI on scenarios that easily exceeds humanity’s collective experience in specific areas. Tracing the technologies I have worked on seems to depict a forward looking trajectory that carries a lot of promise in creating a better world for all, but it also raises some responsibilities that we as engineers need to take seriously. I see computer infrastructure rapidly spreading its tentacles deeper into the physical world while concurrently expanding into the virtual as well. The integration of large-scale computational and physical components enables new revolutionary scenarios ranging from self-driving cars over automated farming and healthcare to even scientific work itself. Cyber-physical systems with AI that eliminates the human-in-the-loop may soon take care of the world we live in. I see the most intriguing technical challenges ahead, but I also see significant issues in the area of ethics, security, reliability, and dependability of these systems.
||Social & Networking