IEEE Infrastructure 2020 Keynote Speakers
National Energy Research Scientific Computing Center (NERSC) at Berkeley Lab, Division Deputy and Department Head
Katie is the Division Deputy and Data Department Head at the National Energy Research Scientific Computing (NERSC) Center at Lawrence Berkeley National Laboratory. Katie is also the Project Director of the Perlmutter system, NERSC’s next generation system arriving in 2021 that will support large scale simulation and data workloads and was Project Director of the 2016 Cori supercomputer. Katie is Principle Investigator on a Research Project titled, “Science Search: Automated MetaData Using Machine Learning” and is interested in how experimental science facilities can leverage High Performance Computing. Before coming to NERSC, Katie worked at the ASC Flash Center at the University of Chicago on the FLASH code, a highly scalable, parallel, adaptive mesh refinement astrophysics application. She has an M.S. in Computer Science from the University of Chicago and a bachelors in Physics from Wellesley College.
Introducing the Perlmutter Supercomputer at Lawrence Berkeley National Laboratory
The Perlmutter Supercomputer at the National Energy Research Scientific Computing (NERSC) Facility will be deployed in early 2021 to support the large scale simulation, data analysis and machine learning workloads of the Department of Energy, Office of Science funded researchers. Named after Nobel Laureate Saul Perlmutter, the 100 PF Cray system will include next-generation AMD CPUs and NVIDIA GPU compute nodes, 30PBs of an all-flash Lustre file system and the Cray Slingshot Ethernet compatible interconnect. This talk will discuss the Perlmutter architecture, applications, software, and plans to support a growing and diversifying workload including AI and large scale data analysis.
HTC/DeepQ Healthcare, President
Stanford University, Adjunct Professor
Edward Y. Chang has acted as the President of AI Research and Healthcare (DeepQ) at HTC since 2012. He also currently serves as an adjunct professor at Stanford University and a technical advisor of SmartNews. His most recent notable work is co-leading the DeepQ project to win the XPRIZE medical IoT contest in 2017 with a 1M USD prize. Prior to his current posts, Ed was a director of Google Research from 2006 to 2012, leading research and development in areas including scalable machine learning, indoor localization, and Google Q&A. His contributions in data-driven machine learning (US patents 8798375 and 9547914) and his ImageNet sponsorship helped fuel the success of AlexNet and the recent resurgence of AI. His developed open-source codes in parallel SVMs, parallel LDA, parallel spectral clustering, and parallel frequent itemset mining (adopted by Berkeley Spark) have been collectively downloaded over 30,000 times. Prior to Google, Ed was a full professor of Electrical & Computer Engineering at the University of California, Santa Barbara. He joined UCSB in 1999 after receiving his PhD from Stanford University. Ed is an IEEE Fellow for his contributions to scalable machine learning.
Scarcity, Diversity, and Privacy of Data in Artificial Intelligence for Precision Medicine
Recent successes of AI in several application domains attest the importance of big data in achieving high performance in domain-specific metrics. In the healthcare domain, however, big data in the million scale is generally not available. Furthermore, safeguarding data privacy is extremely crucial. This talk first enumerates key data issues in applying AI in the healthcare domain including data scarcity, out of distribution, and data privacy. Then, we present remedies including transfer learning for remedying data scarcity, knowledge-guided multimodal learning for out of distribution generalization, and distributed ledgers for preserving data privacy.
Hewlett Packard Labs, Fellow and VP
Paolo Faraboschi is a Fellow and VP at Hewlett Packard Labs. His interests lie at the intersection of hardware and software, including HPC, workload optimized SoCs, and parallel systems. His current research focuses on next-generation memory-driven computing systems, and specifically on the most challenging problems of Exascale computing. In the past, he worked on VLIW processors, compilers, and energy-efficient servers. He is an IEEE Fellow for contributions to embedded processor architecture and SoC technology, He co-authored over 100 publications, 42 patents, and a book. He received his Ph.D. in EE from the University of Genoa, Italy, in 1993.
Infrastructure for Edge-to-Cloud Machine Learning (and vice versa)
The edge-to-cloud infrastructure and machine learning (ML) are becoming increasingly coupled in the real world. In the last decade, large-scale voice and image applications have driven phenomenal breakthroughs in ML algorithms that we all use in our daily interactions with internet service providers. More recently, different industry verticals have started to rapidly adopt similar AI approaches, but their needs extend well beyond standard datacenter applications. These new uses of ML involve complex and noisy multi-sensor data, sparsely labeled ground truth, and complex deployment environments that span from the edge to the cloud. This talk covers some of the aspects of the intricate relationship between ML and the edge-to-cloud world. This relationship goes both ways, and the talk presents two illustrative examples. In one direction, we need a new infrastructure for optimized ML. So, the first example discusses “infrastructure for AI/ML”: an architecture blueprint for ML in a Highly Autonomous Driving (HAD) application that spans from instrumented test cars (at the edge) to the training core (in the cloud). In the other direction, ML helps to optimize the infrastructure. The second example discusses “AI/ML for infrastructure”: the use of ML for operational intelligence (AI-Ops) to automate the monitoring and predictive maintenance in a high-performance datacenter AI-Ops takes a holistic view which includes both IT telemetry, and facility sensors at the edge of the datacenter.
Princeton University, Dean of Engineering and Applied Science
Andrea Goldsmith is the Stephen Harris professor in the School of Engineering and a professor of Electrical Engineering at Stanford University. Her research interests are in information theory, communication theory, and signal processing, and their application to wireless communications, interconnected systems, and neuroscience. She founded and served as Chief Technical Officer of Plume WiFi (formerly Accelera, Inc.) and of Quantenna (QTNA), Inc, and she currently serves on the Board of Directors for Medtronic (MDT) and Crown Castle Inc (CCI). Dr. Goldsmith is a member of the National Academy of Engineering and the American Academy of Arts and Sciences, a Fellow of the IEEE and of Stanford, and has received several awards for her work, including the IEEE Sumner Technical Field Award, the ACM Athena Lecturer Award, the ComSoc Armstrong Technical Achievement Award, the WICE Mentoring Award, and the Silicon Valley/San Jose Business Journal’s Women of Influence Award. She is author of the book ``Wireless Communications'' and co-author of the books ``MIMO Wireless Communications'' and “Principles of Cognitive Radio,” all published by Cambridge University Press, as well as an inventor on 29 patents. She received the B.S., M.S. and Ph.D. degrees in Electrical Engineering from U.C. Berkeley.
Dr. Goldsmith is currently the founding Chair of the IEEE Board of Directors Committee on Diversity, Inclusion, and Ethics. She served as President of the IEEE Information Theory Society in 2009 and as founding Chair of its student committee. She has also served on the Board of Governors for both the IEEE Information Theory and Communications Societies. At Stanford she has served as Chair of Stanford’s Faculty Senate and for multiple terms as a Senator, and on its Academic Council Advisory Board, Budget Group, Committee on Research, Planning and Policy Board, Commissions on Graduate and on Undergraduate Education, Faculty Women’s Forum Steering Committee, and Task Force on Women and Leadership.
Diversity & Inclusion in Engineering: It's About Success
Diversity and Inclusion: The engineering profession cannot reach its maximum potential without embracing the diversity of ideas and experiences that come from people of different backgrounds and creating an inclusive culture where all people can thrive. This talk discuss why diversity is important in engineering, provides a snapshot of diversity metrics in the profession, and highlights how individuals, universities, companies, and the IEEE are working to improve diversity and inclusion in engineering.
Stanford University, Associate Professor of Computer Science
Silvio Savarese is an Associate Professor of Computer Science at Stanford University and Chief Scientist at AiBee Inc since 2018. He earned his Ph.D. in Electrical Engineering from the California Institute of Technology in 2005 and was a Beckman Institute Fellow at the University of Illinois at Urbana-Champaign from 2005–2008. He joined Stanford in 2013 after being Assistant and then Associate Professor of Electrical and Computer Engineering at the University of Michigan, Ann Arbor, from 2008 to 2013. From 2016 to 2018, he served as a director of the SAIL-Toyota Center for AI Research at Stanford. His research interests include computer vision, robotic perception and machine learning. He is recipient of several awards including two Best Paper Awards at ICRA 2019 and CVPR 2018 respectively, a Best Student Paper Award at CVPR 2016, the James R. Croes Medal in 2013, a TRW Automotive Endowed Research Award in 2012, an NSF Career Award in 2011 and Google Research Award in 2010. In 2002 he was awarded the Walker von Brimer Award for outstanding research initiative.
Towards the AI-Driven Revolution: Benefits and Risks
We are at the beginning of a new technological transformation called the AI revolution. This is characterized by speed (driven by exponentially fast computing and communication, e.g. 5G), scale (driven by globalization), big data (driven by unprecedented storage capabilities and distributed compute) and, critically, by having platforms that go from being purely digital to be intimately connected to the physical world (internet of things, robotics, etc…). These ingredients are creating the perfect conditions for fueling a new generation of computing tools based on AI technology that allow generating predictions and guiding humans in making critical decisions. Unlike other technologies, AI is a global phenomenon that has low barrier for entry but with profound transformative effects on entire industries such as transportation, manufacturing, agriculture, construction, retails and health-care, to cite a few. We argue that despite all of the potential glowing advances, however, AI-driven technology is still far from achieving the level of accuracy and reliability that is needed for many critical applications — more research and advancements are still needed to enable machines that can perform on par with humans in many high level tasks. In this talk I will examine recent progress in robotics and machine vision in overcoming some of these limitations and open the door for a new generation of AI systems. I will conclude with a brief overview of benefits and risks of the AI revolution and the opportunity to foster a dialogue on possible recommendations for a responsible leadership on this matter.
Airbnb, Technical Fellow
Raymie Stata is a Technical Fellow at Airbnb, where he focuses on issues related to data management. Prior to Airbnb he was founder and CEO of Altiscale, which offered Spark and Hadoop as a managed service in the cloud. Raymie founded Altiscale after leaving Yahoo!, where he was Chief Technical Officer. At Yahoo!, he played an instrumental role in algorithmic search, display advertising, and cloud computing. He also helped set Yahoo’s Open Source strategy and initiated its participation in the Apache Hadoop project. Prior to joining Yahoo!, Raymie founded Stata Laboratories, maker of the Bloomba search-based e-mail client, which Yahoo! acquired in 2004. He has also worked for Digital Equipment’s Systems Research Center, where he contributed to the AltaVista search engine. Raymie received his PhD in Computer Science from MIT in 1996.
Agility with Stability
A fundamental challenge in running software services is managing the conflict between agility and stability. On the one hand, market forces create unending pressure to add features fast. On the other, the faster you try to run, the more likely you are to break something. In this talk, we contend that the "trade-off curve" between agility and stability is fixed for a given organization. We also describe the forces that cause this curve to degrade over time - including the underappreciated challenges of increasing the size of the team. We conclude by discussing investments that can improve the shape of the curve for your organization.
Micron Technology, Inc., Vice President
With a passion for a great technology story and a love for all things nerdy, Martina helps demystify the tech behind the hype by talking with experts and business leaders driving the trends. Martina is an award-winning product marketing and communications executive with nearly 20 years of R&D, enterprise, start-up, and higher education experience. She has led global teams at Micron and Hewlett Packard Enterprise, responsible for telling an amazing innovation story to customers, partners, employees, media and the world.
She has also hosted Micron’s “Pulling Together” and “Heart of Micron” video/podcast series and was the host and executive producer of The Element podcast while at HPE. Martina was a founding member of a privately-held start-up in Munich, Germany – Symplon AG – specializing in early Tablet PC hardware and mobile computing solutions and consulting.
Martina earned a bachelor’s degree in economics from the Wharton School at the University of Pennsylvania and a master’s degree in digital business management from HEC Paris and Télécom ParisTech. She is author of over 20 peer-reviewed articles and papers and has been a frequent speaker on emerging technology trends such as AI, blockchain, cloud, Computing Beyond Moore’s law, open innovation and economic development. She served as Patronage Chair for the IEEE International Conference on Rebooting Computing (ICRC 2019), and has previously served as Chair of the Advisory Board of the NSF-funded Caribbean Computing Center for Excellence.
Trust During Turbulent Times: How to Use Strategic Communications and Radical Transparency to Engage Your Employees and Transform Your Organization
During challenging times, leaders must step up to communicate with clarity, transparency and credibility. Today, companies have a unique opportunity to act as a trusted, expert voice to their employees, customers, partners and communities. Join this talk to hear more about lessons learned from the front lines of communications during COVID-19 and how you can put these marketing and communication strategies to work to drive engagement and trust with your team members and colleagues, and ultimately to deliver better business and technical outcomes.
Salesforce, Software Architect
Manoj Agarwal is a Software Architect in the Einstein Platform team at Salesforce. He has almost 25 years of experience in the industry, building distributed systems, public cloud services and machine learning platforms.
Serving Very Large Numbers of Low Latency AutoML Models
ML Serving infrastructure is becoming ubiquitous in the emerging ML industry as well as in public cloud offerings. Existing solutions overwhelmingly rely on serving models as containers where one container hosts a single model with all its required dependencies. Salesforce and Einstein Platform has a unique multi-tenancy approach that heavily relies on AutoML and enforces automated feature engineering, training, and serving of separate models per tenant, ultimately resulting in serving a total of up to hundreds of thousands of models. In addition, the model size, initialization time and popularity/volume can vary widely based on the underlying customer base of each tenant, introducing what we call the model balancing problem. We present our approach to scale to very large number of models using multi level routing and load balancing, sharing hundreds of models within each container as well as utilizing sophisticated metric driven mechanisms for model initialization, warmup and model balancing. In addition we’ll present our solution for managing model versions and dependencies in shared container scenarios and finally lessons learned on our journey in this nascent space.
Rockset, CTO and Co-Founder
Dhruba Borthakur is cofounder and CTO at Rockset (http://rockset.com), a company building a realtime cloud database for data-powered applications. Previously, Dhruba was the founding engineer of the open source RocksDB database at Facebook and one of the founding engineers of the Hadoop file system at Yahoo; an early contributor to the open source Apache HBase project; a senior engineer at Veritas, where he was responsible for the development of VxFS and Veritas SanPointDirect storage system; the cofounder of Oreceipt.com, an ecommerce startup based in Sunnyvale; and a senior engineer at IBM-Transarc Labs, where he contributed to the development of Andrew File System (AFS). Dhruba did his graduate studies at the University of Wisconsin, Madison.
The Changing Face of Data Analytics from Batch Analytics to Analytics-on-the-fly
Data Analytics started with Hadoop in 2005 when it made possible to mine large data sets to extract intelligence from it. These were batch jobs that could run for hours. Then a natural evolution happened around 2010 when Apache Spark and Kafka enabled stream processing of data at scale. The stream processing movement reduced the time from data-to-insights to tens of minutes. But some industries demanded lowering query latencies along with lowering data latencies. For example, the Facebook Newsfeed and the Linkedin FollowFeed applications required super-low data latency. It is 2018, enter Analytics-on-the-fly!
In this talk, Dhruba walks you through the history and evolution of the original Hadoop architecture that separates compute from storage. He inspects the Lambda Architecture and the reasons why it is popular for Stream Processing systems. Then he goes on to describe the Aggregator-Leaf-Tailer (ALT) architecture that provides Analytics-on-the-fly by allowing fast queries on semi structured data. He peels apart the disaggregated and cloud-friendly nature of ALT that allows one to scale compute, storage, data rates and query volumes independently. He describes how replacing the map-reduce framework in earlier generation Hadoop with an RocksDB-based indexing framework in the ALT architecture reduces query latency. He talks about the CQRS pattern in this architecture that isolates query latencies from bursty steams. He elaborates why these new applications demand higher concurrent queries and how the ATL architecture provides it. The talk concludes with a description of how the ALT architecture is already in production in a set of applications at Facebook, Linkedin and Rockset.
Shutterstock, Software Development Manager
Anusha Dayananda is a Software Development Manager at Shutterstock. In this role, she is responsible for the Content Pipeline and the Localization/Internationalization Engineering teams. These teams support Shutterstock’s core content ingestion to publishing lifecycle at scale, and support the site’s offering in 21 languages. Anusha leads a team of engineers to successfully solve complex and challenging problems for Shutterstock’s 1 million Contributors and 1.9 million Customers.
Anusha has over 10 years of experience in Software Engineering across different domains. She previously spent more than seven years working on a software development team in the healthcare technology industry, where she worked to simplify healthcare data collection and reporting for hospitals and government organizations across the country. Anusha holds an MS in Computer Science from New Jersey Institute of Technology.
How Shutterstock Enhanced Stability and Reliability for Message-Driven Applications
With a growing library of over 310 million images and 17 million videos, Shutterstock needs to quickly ingest, review and publish over 200,000 assets per day efficiently to enable its customers around the world to deliver impactful stories. Using its own innovative practices, Shutterstock developed robust internal libraries that utilize RabbitMQ message-brokers to support multiple applications. Shutterstock’s Software Development Manager, Anusha Dayananda, will provide her observations and learnings on ensuring stability and reliability for distributed systems. The talk will cover best practices on eliminating central control so that engineering teams can easily adopt and operate.
- Best practices for eliminating risks against cluster failures
- How to reduce cross-team dependency
- Ways to optimize engineering workflows for time savings and efficiency
Google, Software Engineer
Nova is a software engineer working on Federated Learning platform infrastructure at Google, building services that carry out privacy preserving machine learning and analytics algorithms at scale. Prior to working on Federated Learning she worked on the Google Cloud BigQuery data warehouse. She graduated from the University of Pennsylvania in 2017 with a B.S.E. and M.S.E. in Computer Science. At Penn she was a member of the IEEE Eta Kappa Nu honor society.
Privacy-Preserving Machine Learning
Federated Learning enables mobile devices to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. Topics to be presented include: (1) how federated learning differs from more traditional machine learning paradigms; (2) practical algorithms for federated learning that address the unique challenges of this setting; (3) extensions to federated learning, including differential privacy, secure aggregation, and compression for model updates, and (4) a quick overview of federated learning applications and systems at Google.
Walmart Labs, Principal Machine Learning Engineer
As part of the first Merchant technology data science team at Walmart Labs, Hamza has worked to help create a data-informed culture at the company. During his tenure, he has led the first ever team to make assortment, pricing and replenishment recommendation in stores, eventually owning the PnL for them. Currently, he leads multiple data science team, responsible for building machine learning & optimization algorithms to help merchants make better decision.
Going Beyond Big Data: Taking ML to the Next Level
All retailers want to know their target buyers better. However, understanding the past and present of their interactions simply isn’t enough these days and predictive analytics is the next step to better understanding their customers. In this session, topics of discussion will include, yet will not be limited to:
- How ML can enable price optimization, product placement and assortment selection
- Using machine learning algorithms effectively for generating suggestions for substitute and complimentary items
- Using deep learning and reinforcement learning to improve order forecasting
- Utilizing optimization algorithms to reduce store costs by optimizing replenishment cycle and safety stock
- Scaling algorithms to generate recommendation for individual stores and to monitor their performance
Aloke Guha is a serial entrepreneur with an extensive career working in data center technologies including storage and networking, big data, and machine learning and analytics predating the AI winters. Before co-founding OpsCruise, he was Vice President of Analytics and Big Data products at Hitachi Data Systems. His earlier startups include Copan Systems, a pioneer in MAID data storage (acquired by HPE), Datavail/CreekPath (acquired by HP), as well a number of machine learning startups in areas of real-time intent detection and contextual search and analytics. Previously, he was VP and Chief Architect at StorageTek (acquired by Oracle) and CTO of Network Systems (acquired by StorageTek).
Aloke is a senior IEEE member, and has authored over 60 patents with 26 issued, and over 60 technical publications.
He holds a B. Tech (EE) from the Indian Institute of Technology (Kanpur), and an MSEE and Ph.D. from the University of Minnesota.
Model Based Control for Microservices Applications
The move to cloud to leverage agility and scale initiated a fundamental shift from monolithic to microservices architecture. While this shift improved the agility of application development (Dev) teams, it created significant challenges for Operations (Ops) teams. These challenges result from the variability in application structure and behavior, shifting bottlenecks, and limited visibility and control of cloud infrastructure resources and services. Traditional performance management approaches such as queuing networks or using heuristics that capture static relationships between performance and resources are also no longer applicable with highly dynamic virtual cloud environments.
We have designed and implemented a new model-based control platform for managing microservices application performance. Relying on existing monitoring, events and configuration data, especially those collected from the open CNCF framework, such as Kubernetes and Prometheus, and without instrumenting the application code, we automatically discover and build the application structure. Subsequently, using both machine learning (ML) and a priori knowledge of known services, we build a predictive application behavior model for the complete application. The application model and structure can provide sufficient granularity to detect and predict performance problems, isolate the cause of incidents, and recommend remedial actions to the infrastructure and services for problem resolution.
Intel, Fellow and Chief Technology Officer of Information Technology
Shesha Krishnapura is an Intel Fellow and chief technology officer in the Information Technology organization at Intel Corporation. He is responsible for advancing Intel data centers for energy and rack space efficiency, disaggregated server innovation and hardware designs, high-performance computing (HPC) for electronic design automation (EDA), and optimized platforms for enterprise computing. He is also responsible for fostering unified technical governance across IT, leading consolidated IT strategic research and pathfinding efforts, and advancing the talent pool within the IT technical community to help shape the future of Intel.
A three-time recipient of the Intel Achievement Award, Shesha was appointed an Intel Fellow in 2016. His external honors include an InformationWeek Elite 100 award, an InfoWorld Green 15 award and recognition by the U.S. Department of Energy for industry leadership in energy efficiency. He has been granted several patents and has published more than 75 technical articles.
Shesha holds a bachelor’s degree in electronics and communications engineering from Bangalore, India, and a master’s degree in computer science from Oregon State University. He is the founding chair of the EDA computing board of advisers for platform standards. He has represented Intel as a voting member of the Open Compute Project incubation committee.
At Scale Green Computing Innovations for Datacenter Transformation
Datacenters are the backbone of internet commerce and cloud storage, and they are also hugely important to support the computational and transactional needs of today’s corporate world. They currently consume 3% the world’s electrical supply, and the world produces upwards of 50 million metric tons of e-waste annually. Thus, Total Cost to the Environment (TCE) is as important as Total Cost of Ownership (TCO) for datacenters.
Intel has a huge datacenter investment with over 290,000 servers (which include over 2 million Xeon high clock cores), over 384 petabytes of storage, and more than 545,000 network ports within its 86-Megawatt data center capacity. This infrastructure is required to support efforts including complex chip design, and Intel’s overall computing needs have grown in excess of 6000% over the past 13 years.
This talk will discuss aspects of the complexities of building and maintaining these datacenters, as well as how green computing initiatives have delivered substantial power reduction costs while significantly contributing towards reducing greenhouse gas emissions. Being Green is not just running datacenters at the most efficient Power Usage Effectiveness (PUE) levels, but also reducing e-waste.
Datakin, CTO and Co-founder
Julien Le Dem
Julien Le Dem is the CTO and Co-Founder of Datakin. He co-created Apache Parquet and is involved in several open source projects including Marquez (LF AI), Apache Pig, Apache Arrow, Apache Iceberg and a few others. Previously, he was a senior principal at Wework; principal architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.
Data Platform Architecture Principles
We’re well into the Big Data era. Most organizations embraced collecting data and analyzing what’s happening inside their products. This is crucial to their success, not only to understand what works but also to optimize their services and increase their value to their customers. Several industries are being disrupted just by using technology to optimize existing processes. Think about cabs, short term rentals, or co-working spaces.
There’s also discussion around what constitutes “big” data. Here, we’re not only talking about large volumes of data produced by the likes of Google, Facebook and others very large companies. We’re also talking about the multitude of various datasources and the many teams using them and producing derived datasets. The concept of central data team that does all the data related work is outdated and the entire organization should become an ecosystem where teams depend on each other. Central data teams now become enablers, coaching and providing a safe and flexible environment to move fast while bringing transparency to the increasing complexity of interdependent system. Data processing and micro services have similar requirements in terms of ownership, monitoring and dependency management.
In this talk we will discuss the principles to follow while building the data platform enabling the entire organization to build data driven products, whether using insights from the data or using data directly to build features (for example recommendations).
Every team can consume and produce data using explicit contracts: What they share or don’t, the level of service they provide and the quality of the data. We need to build visibility to the entire org and help evolve the dependency graph with global lineage and schema evolution.
The platform is self-service and gets out of the way to empower users to do the right thing. It provides a safe environment where mistakes can be easily mitigated and the scope of their impact limited. It is flexible to allow users to pick the best tool for the job while facilitating interdependencies. Streaming and batch processing are complementary and work together. Governance is delegated to the appropriate stewards. Sensitive data is properly annotated and secured and its usage tracked and controlled. Cloud being omnipresent, users expect not to have to worry about where the processes are run or where the data is stored. The platform is expected to scale transparently and be billed by the minute.
We’ll discuss the best tools making the data platform and how to build the missing pieces.
Netflix, Senior Software Engineer
Joseph Lynch is a Senior Software Engineer at Netflix who focuses on building high volume datastore infrastructure for providing low latency data access. He is a core contributor to Netflix’s realtime datastore platform, which supports their always-available polyglot persistence tier including Cassandra, Elasticsearch, CockroachDB, Zookeeper, and more. He loves building distributed systems and learning the fun and exciting ways that they scale, operate, and break. Prior to Netflix, he helped Yelp scale distributed databases and was a key engineer in the design and implementation of their service mesh architecture. Joseph graduated from the Massachusetts Institute of Technology in 2013 with a SB in Electrical Engineering and Computer Science.
Towards Practical Self-Healing Distributed Databases
As distributed databases expand in popularity, there is ever-growing research into new database architectures that are designed from the start with built-in self-tuning and selfhealing features. In real world deployments, however, migration to these entirely new systems is impractical and the challenge is to keep massive fleets of existing databases available under constant software and hardware change. Apache Cassandra is one such existing database that helped to popularize “scale-out” distributed databases and it runs some of the largest existing deployments of any open-source distributed database. In this paper, we demonstrate the techniques needed to transform the typical, highly manual, Apache Cassandra deployment into a self-healing system. We start by composing specialized agents together to surface the needed signals for a self-healing deployment and to execute local actions. Then we show how to combine the signals from the agents into the cluster level controlplanes required to safely iterate and evolve existing deployments without compromising database availability. Finally, we show how to create simulated models of the database’s behavior, allowing rapid iteration with minimal risk. With these systems in place, it is possible to create a truly self-healing database system within existing large-scale Apache Cassandra deployments.
Hewlett Packard Enterprise, NFV R&D
Chengappa M. R.
Chengappa is associated with HP/HPE for a decade and is currently associated with Compute Business Group, catering to the requirements and solutions for telco on NFV Infrastructure. From an early career as an intern software engineer, his current focus areas have moved him into defining, building, and delivering NFV-based telco solutions bringing together technologies that form building blocks integrating Cloud-native applications / VNFs on NFVi platforms. Being IEEE Senior member, he also plays a key role in being part of IEEE Bangalore Section initiatives around Industry engagement, Professional Activities.
Open Distributed Infrastructure Management - ODIM
The evolution of the Telecom network to 5G and Edge Computing requires IT compute, storage and networking infrastructure from multiple vendors to be deployed across potentially thousands of geographically distributed and diverse points of presence, including central offices, cell tower huts, wiring closets as well as traditional data centers. This sharply contrasts with the comparatively few homogeneous hyper-scale data centers of the cloud service providers. Telecom operators are demanding open, scalable infrastructure management for this distributed multi-vendor infrastructure based on widely adopted industry standards. In this paper, we are addressing these unique challenges through an open-source, Distributed Management Task Force (DMTF) Redfish® based Resource Aggregation function, helping create a vibrant open-source infrastructure management community around the Resource Aggregator will allow in the building of next-generation telecommunications and other highly distributed networks.
Netflix, Senior Software Engineer
Cody Rioux is a Software Engineer designing and developing real-time observability platforms at Netflix. His work ranges from outlier/anomaly detection to control theory based autoscalers to large scale distributed systems. He is currently a core committer to Mantis, an open source real-time observability platform which plays a critical role in keeping the Netflix cloud environment reliable. His work has been featured on the Netflix Tech blog, as well as at several meetups, Strata+Hadoop World and PyData.
Mantis Query Language: On-Demand Real-Time Intro
Mantis and Mantis Query Language (MQL) is an open source platform which provides Netflix service owners with infrastructure and a DSL for instantly accessing real-time insights across the entire ecosystem. Currently Mantis moves over 2 trillion events per day encompassing several petabytes of data. We'll explore the architecture of Mantis and how it enables low-cost real-time data as well as the design of the query language which provides an expressive abstraction over all of this machinery. Ultimately we'll see how Mantis enables use cases from outlier/anomaly detection through A/B testing and much more for a reliable cloud experience and how you can take advantage of the open source version to observe and monitor your own cloud environment.
Mentor, a Siemens Business
Christopher Wolff earned his Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University. He has worked at many IC and EDA companies and currently leads an infrastructure team for the Catapult product. He is a Senior IEEE member.
Share File Updating for Test Farms Using a File Cache
When a file needs to be updated by thousands of tests running on a compute farm, waiting for a chance to update slows down the overall runtime as the number of parallel test runs increases. We implemented a file cache that allows each test to write a small temporary file that either updates the shared file or is stored for later updating. A network file lock controls access to the shared file. If the file lock is not acquired, temporary files are merged to reduce the number of future file reads. Temporary files are removed only if their contents is successfully added to the share file or merged into another temporary file. Each test needs to update the shared file many times during a run, so a version number is used to select the temporary file to read and discards the others. The file cache eliminated the time delay for updating the shared file, savings hours of runtime per test suite and speeding up developer check-in time with no penalty for increasing tests or compute nodes. This paper will describe how the file cache was developed, detail the current approach, and illustrate the performance benefits.
Twitter, Sr. Staff Software Engineer
Lohit is part of DataPlatform team at Twitter. He concentrates on projects around storage, compute and log pipeline for Twitter scale both on premise and cloud. He has worked at several startups before joining Twitter. He has a Masters degree in Computer Science from Stony Brook University.
Lessons from Scaling HDFS for Exabyte Storage at Twitter
In the past decade Hadoop Distributed FileSystem (HDFS) has become the de facto storage choice for large scale analytics. Growing data storage needs at Twitter has forced us to solve multiple scalability, reliability, hardware layout and performance problems in HDFS. At Twitter we have scaled HDFS storage across multiple clusters by incrementally consuming new features and contributing scalability fixes to the open source community. In this presentation we lay out our architecture for scaling HDFS for Exabyte storage, share our learnings managing such a large scale distributed system and note few changes for future. Participants will get an introduction to the challenges of managing a large scale distributed storage system. They will know about Twitter Data Architecture and journey in scalding an open source system.
Twitter, Staff Software Engineer
Zhenzhao works at Twitter as part of the DataPlatform organization. He is currently concentrating on Twitter Log Ingestion Pipeline which scales to handle trillions of events per day. Previously he was a member of DFS(Pangu) team in Alibaba Cloud where he focused on features for random file access file in Pangu used as storage for Virtual Machines.
Scaling Event Aggregation at Twitter to Handle Billions of Events per Minute
Log files consisting of events from different services are a rich source of information for large scale analytics. Events can be as simple as log line or as complex as nested structured objects like thrift or protobuffers. At Twitter every service logs events for a particular category and publishes them to the Event Log Aggregation framework. This framework aggregates events of the same category into log files, usually stored on a distributed file system like the Hadoop Distributed File System (HDFS). Large Scale multi-petabyte analytics use these files across hundreds of projects. In this paper we provide an overview of the Event Aggregation framework used at Twitter, highlight its advantages, and compare it with similar frameworks. We also introduce the concept of category group and aggregator group in our architecture. Services at Twitter generate trillions of events with aggregate size exceeding multiple petabytes of data every day. At present this framework handles over three billion events per minute. The main focus of our efforts has been efficient use of hardware resources, scalability and reliability of the framework.
Microsoft, Senior Program Manager
Faith Xu is a Senior Program Manager at Microsoft on the Machine Learning Platform team, focusing on frameworks and tools. She leads efforts to support performant and scalable accelerated ML model inferencing across a variety of high-volume products. She is an evangelist for adoption of the open source ONNX standard with community partners to promote an open ecosystem in AI.
Faster Scalable ML Model Deployment Using ONNX and Open Source Tools
As ML developments shift from research to real world, we encounter many deployment challenges. Teams may be experimenting with various training frameworks, with deployments targeting multiple platforms and hardware. While training using one framework with one hardware target can easily be managed, it becomes challenging with a matrix of multiple frameworks and deployment targets. This fragmented ecosystem introduces deployment complexities and oftentimes custom code is needed to maximize performance for each scenario, which is time-consuming to maintain when models are updated.
To streamline this, the interoperable ONNX model format and ONNX Runtime inference engine can be utilized to deploy models performantly across a variety of hardware. Models trained from PyTorch, Tensorflow, scikit-learn, CoreML, and more can all be converted to the common ONNX format, and the model can then be inferenced using the cross-platform performance-focused ONNX Runtime inference engine, which supports various hardware options for acceleration across CPU and GPUs.
ONNX Runtime is already used in key Microsoft services, on average realizing 2x performance improvements. In this session, we'll share an overview of ONNX Runtime, success stories and usage examples from high volume product groups at Microsoft, and demonstrate ways to integrate this into your AI workflows for immediate impact.