Lenovo Big Data ValidatedDesign for ClouderaEnterprise with Local andDecoupled SAS StorageLast update: 24 October 2018Version 1.3Configuration Reference Number: BDCLDRXX83Solution based on theReference architecture forCloudera Enterprise with Apache ThinkSystem SR650 server,bare-metal and virtualizedHadoop and Apache SparkDeployment considerations forscalable racks including detailedvalidated bills of materialSolution based on ThinkSystemSD530 compute node with D3284SAS storage expansion enclosureDan Kangas (Lenovo)Weixu Yang (Lenovo)Ajay Dholakia (Lenovo)Dwai Lahiri (Cloudera)Click here to check for updates1Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

Table of Contents1Introduction . 52Business problem and business value. 63Requirements . 8Functional Requirements . 8Non-functional Requirements. 84Architectural Overview . 9Cloudera Enterprise . 9Bare-metal Cluster - Local and External SAS Storage (JBOD) . 9Virtualized Cluster with VMware vSphere . 115Component Model . 12Cloudera Components . 13Apache Spark on Cloudera . 156Operational Model . 17Hardware Description . 176.1.1Lenovo ThinkSystem SR650 Server . 176.1.2Lenovo ThinkSystem SR630 Server . 186.1.3Lenovo ThinkSystem SD530 Compute Server . 196.1.4Lenovo RackSwitch G8052 . 196.1.5Lenovo RackSwitch G8272 . 206.1.6Lenovo RackSwitch NE2572 . 206.1.7Lenovo RackSwitch NE10032 . 216.1.8Lenovo D3284 SAS Expansion Enclosure . 22Cluster Node Configurations . 226.2.1Worker Nodes . 236.2.2Master and Utility Nodes . 246.2.3System Management and Edge Nodes . 266.2.4External SAS Storage Node . 26Cluster Software Stack . 286.3.12Cloudera Enterprise CDH . 28Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

6.3.2Red Hat Operating System. 29Cloudera Service Role Layouts. 29System Management . 31Networking . 326.6.1Data Network . 336.6.2Hardware Management Network . 336.6.3Multi-rack Network . 346.6.410Gb and 25Gb Data Network Configurations . 35Predefined Cluster Configurations . 3676.7.1SR650 Configurations . 376.7.2SD530 with D3284 Configurations . 396.7.3Cluster Storage Capacity. 406.7.4Storage Tiering with NVMe and SSD Drives . 426.7.5D3284 Storage Tiering . 426.7.6SD530 and D3284 Configuration Options . 44Deployment considerations . 45Increasing Cluster Performance. 45Processor Selection . 457.2.1SR630/SR650 Processors. 467.2.2SD530 Processors . 46Designing for Storage Capacity and Performance . 467.3.1Node Capacity . 467.3.2Node Throughput . 467.3.3HDD Controller . 47Memory Size and Performance. 47Data Network Considerations . 49Designing with Hadoop Virtualized Extenstions (HVE) . 507.6.1Enabling Hadoop Virtualization Extensions (HVE) . 50Cloudera VMware Virtualized Configuration . 527.7.1Cluster Software Stack . 527.7.2ESXi Hypervisor and Guest OS Configuration: . 52Estimating Disk Space . 53Scaling Considerations . 547.9.1Scaling D3284 External SAS JBOD Storage . 547.9.1Scaling D3284 Storage and SD530 Compute Independently . 55High Availability Considerations . 557.10.13Network Availability . 55Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

7.10.2Cluster Node Availability . 567.10.3Storage Availability . 567.10.4Software Availability . 56Linux OS Configuration Guidelines . 577.11.1OS configuration for Cloudera CDH . 577.11.2OS Configuration for SAS Multipath . 57Designing for High Ingest Rates . 598Bill of Materials - SR650 Nodes . 60Master Node . 60Worker Node . 61System Management Node. 63Management Network Switch . 64Data Network Switch . 64Rack . 64Cables . 659Bill of Materials - SD530 with D3284 . 66Master Node . 66Worker Node . 67Systems Management Node . 68External SAS Storage Enclosure . 69Management Network Switch . 70Data Network Switch . 70Rack . 71Cables . 71Software . 7110 Acknowledgements . 7211 Resources . 73Document history . 754Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

1 IntroductionThis document describes the reference architecture for Cloudera Enterprise on bare-metal with locallyattached storage and with decoupled compute and storage, and on a virtualized platform with VMwarevSphere. It provides a predefined and optimized hardware infrastructure for the Cloudera Enterprise, adistribution of Apache Hadoop and Apache Spark with enterprise-ready capabilities from Cloudera. Thisreference architecture provides the planning, design considerations, and best practices for implementingCloudera Enterprise with Lenovo products.Lenovo and Cloudera worked together on this document, and the reference architecture that is describedherein was validated by Lenovo and Cloudera.With the ever-increasing volume, variety and velocity of data becoming available to an enterprise comes thechallenge of deriving the most value from it. This task requires the use of suitable data processing andmanagement software running on a tuned hardware platform. With Apache Hadoop and Apache Sparkemerging as popular big data storage and processing frameworks, enterprises are building so-called DataLakes by employing these components.Cloudera brings the power of Hadoop to the customer's enterprise. Hadoop is an open source softwareframework that is used to reliably manage large volumes of structured and unstructured data. Clouderaexpands and enhances this technology to withstand the demands of your enterprise, adding management,security, governance, and analytics features. The result is that you get a more enterprise ready solution forcomplex, large-scale analytics.VMware vSphere brings virtualization to Hadoop with many benefits that cannot be obtained on physicalinfrastructure or in the cloud. Virtualization simplifies the management of your big data infrastructure, enablesfaster time to results and makes it more cost effective. It is a proven software technology that makes itpossible to run multiple operating systems and applications on the same server at the same time.Virtualization can increase IT agility, flexibility, and scalability while creating significant cost savings.Workloads get deployed faster, performance and availability increases and operations become automated,resulting in IT that is simpler to manage and less costly to own and operate.The intended audience for this reference architecture is IT professionals, technical architects, salesengineers, and consultants to assist in planning, designing, and implementing the big data solution withLenovo hardware. It is assumed that you are familiar with Hadoop components and capabilities. For moreinformation about Hadoop, see “Resources” on page 73.5Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

2 Business problem and business valueBusiness ProblemThe world is well on its way to generate more than 40 million TB of data by 2020. In all, 90% of the data in theworld today was created in the last two years alone. This data comes from everywhere, including sensors thatare used to gather climate information, posts to social media sites, digital pictures and videos, purchasetransaction records, and cell phone global positioning system (GPS) signals. This data is big data.Big data spans the following dimensions: Volume: Big data comes in one size: large – in size, quantity and/or scale. Enterprises are awash withdata, easily amassing terabytes and even petabytes of information. Velocity: Often time-sensitive, big data must be used as it is streaming into the enterprise to maximizeits value to the business. Variety: Big data extends beyond structured data, including unstructured data of all varieties, such astext, audio, video, click streams, and log files.Enterprises are incorporating large data lakes into their IT architecture to store all their data. The expectationis that ready access to all the available data can lead to higher quality of insights obtained through the use ofanalytics, which in turn drive better business decisions. A key challenge faced today by these enterprises issetting up an easy to deploy data storage and processing infrastructure that can start to deliver the promisedvalue in a very short amount of time. Spending months of time and hiring dozens of skilled engineers to piecetogether a data management environment is very costly and often leads to frustration from unrealized goals.Furthermore, the data processing infrastructure needs to be easily scalable in addition to achieving desiredperformance and reliability objectives.Big data is more than a challenge; it is an opportunity to find insight into new and emerging types of data tomake your business more agile. Big data also is an opportunity to answer questions that, in the past, werebeyond reach. Until now, there was no effective way to harvest this opportunity. Today, Cloudera uses thelatest big data technologies such as the in-memory processing capabilities of Spark in addition to the standardMapReduce scale-out capabilities of Hadoop, to open the door to a world of possibilities.Business ValueHadoop is an open source software framework that is used to reliably manage and analyze large volumes ofstructured and unstructured data. Cloudera enhances this technology to withstand the demands of yourenterprise, adding management, security, governance, and analytics features. The result is that you get anenterprise-ready solution for complex, large-scale analytics.How can businesses process tremendous amounts of raw data in an efficient and timely manner to gainactionable insights? Cloudera allows organizations to run large-scale, distributed analytics jobs on clusters ofcost-effective server hardware. This infrastructure can be used to tackle large data sets by breaking up thedata into “chunks” and coordinating data processing across a massively parallel environment. After the rawdata is stored across the nodes of a distributed cluster, queries and analysis of the data can be handledefficiently, with dynamic interpretation of the data formatted at read time. The bottom line: Businesses canfinally get their arms around massive amounts of untapped data and mine that data for valuable insights in amore efficient, optimized, and scalable way.6Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

Cloudera that is deployed on Lenovo System x servers with Lenovo networking components provides superiorperformance, reliability, and scalability. The reference architecture supports entry through high-endconfigurations and the ability to easily scale as the use of big data grows. A choice of infrastructurecomponents provides flexibility in meeting varying big data analytics requirements.There is growing interest in deploying Hadoop on a virtualized infrastructure driven by the promise of ease ofmanaging the cluster during initial deployment as well as adding more nodes when data storage andprocessing requirements grow. The ability to have virtualized Hadoop environment look and feel the same asit does on a bare-metal infrastructure allows flexibility in incorporating the solution within an enterprise’s datamanagement architecture.7Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

3 RequirementsThe functional and non-functional requirements for this reference architecture are desribed in this section.Functional RequirementsA big data solution supports the following key functional requirements: Ability to handle various workloads, including batch and real-time analytics Industry-standard interfaces so that applications can work with Cloudera Ability to handle large volumes of data of various data types Various client interfacesNon-functional RequirementsCustomers require their big data solution to be easy, dependable, and fast. The following non-functionalrequirements are key: 8Easy:oEase of developmentoEasy management at scaleoAdvanced job managementoMulti-tenancyoEasy to access data by various user typesDependable:oData protection with snapshot and mirroringoAutomated self-healingoInsight into software/hardware health and issuesoHigh availability (HA) and business continuityFast:oSuperior performanceoScalabilitySecure and governed:oStrong authentication and authorizationoKerberos supportoData confidentiality and integrityLenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

4 Architectural OverviewCloudera EnterpriseFigure 1 shows the main features of the Cloudera reference architecture that uses Lenovo hardware. Userscan log into the Cloudera client from outside the firewall by using Secure Shell (SSH) on port 22 to access theCloudera solution from the corporate network. Cloudera provides several interfaces that allow administratorsand users to perform administration and data functions, depending on their roles and access level. Hadoopapplication programming interfaces (APIs) can be used to access data. Cloudera APIs can be used for clustermanagement and monitoring. Cloudera data services, management services, and other services run on thenodes in cluster. Storage is a component of each data node in the cluster. Data can be incorporated intoCloudera Enterprise storage through the Hadoop APIs or network file system (NFS), depending on the needsof the customer.A database is required to store the data for Cloudera manager, hive metastore, and other services. Clouderaprovides an embedded database for test or proof of concept (POC) environments and an external database isrequired for a supportable production environment.Figure 1. Cloudera architecture overviewBare-metal Cluster - Local and External SAS Storage(JBOD)The big data cluster solutions described in this document can be deployed on bare-metal infrastructure. Thismeans that both the management nodes and the data nodes are implemented on physical host servers. Thenumber of servers of each type is determined based on requirements for high-availability, total data capacityand desired performance objectives. This reference architecture provides validated solutions for tradition9Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

local storage on the Lenovo SR650 as well as external storage using the Lenovo SD530 dense computenodes with Lenovo D3284 external SAS storage enclosure, configured for non-RAID JBOD (Just-a-Bunch-OfDrives) which gives over 40% more storage capacity per rack and more compute nodes compared to nodeswith internal HDD storage.The cornerstone server for Cloudera big data clusters will be the SR650 with the highest performance of allselections of processor, memory, and storage. The SD530 with external D3284 SAS enclosure solutionprovides dense and optimized storage with the highest storage capacity per rack and highest compute nodecount per rack. The SD530 solution also allows scaling up compute nodes separately from the storagenodes.With Hadoop external SAS HDD storage, a separate 5U direct access SAS external storage enclosure isused with up to 6 dense SD530 compute nodes connected via SAS cabling. This allows upgrading computenodes with new technology without impacting the storage nodes. Also, the dense form factors increase boththe Cloudera rack storage and compute node count by over 40% compared. Scale out of the external SASenclosure is as usual for Hadoop where compute nodes are added one at a at time while a new SASenclosure added for each 6 compute nodes.With Hadoop local storage, the SR650 server contains compute and storage in the same physical enclosure.Scale out is accomplished by adding one or more nodes which add both compute and storage simultaneouslyto the cluster. The Lenovo SR650 2U node provides the highest CPU core count and highest total memoryper node for a very high end analytics solution.The graphic below gives a high level view of external SAS storage vs. internal SAS storage.SR650 Internal SAS StorageSD530 External Direct Attach SAS StorageFigure 2. Hadoop local and external SAS storage topologies10Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

Virtualized Cluster with VMware vSphereWhen Hadoop is virtualized, all of the components of Hadoop, including the NameNode, ResourceManager,DataNode, and NodeManager, are running within purpose-built Virtual Machines (VMs) rather than on thenative OS of the physical machine. However, the Hadoop services or roles of the Cloudera software stackare installed with Cloudera Manager exactly the same way as with the physical machines. With avirtualization infrastructure, two or more VMs can be run on the same physical host server to improve clusterusage efficiency and flexibility.The VMware-based infrastructure with direct attached storage for HDFS is used to maintain the storage-toCPU locality on a physical node. VMs are configured for one-to-one mapping of a physical disk to a vSphereVMFS virtual disk - see Figure 3 below.Figure 3. One-to-one mapping of local storage11Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

5 Component ModelCloudera Enterprise provides features and capabilities that meet the functional and nonfunctionalrequirements of customers. It supports mission-critical and real-time big data analytics across differentindustries, such as financial services, retail, media, healthcare, manufacturing, telecommunications,government organizations, and leading Fortune 100 and Web 2.0 companies.Cloudera Enterprise is the world’s most complete, tested, and popular distribution of Apache Hadoop andrelated projects. All of the packaging and integration work is done for you, and the entire solution is thoroughlytested and fully documented. By taking the guesswork out of building out your Hadoop deployment, ClouderaEnterprise gives you a streamlined path to success in solving real business problems with big data.The Cloudera platform for big data can be used for various use cases from batch applications that useMapReduce or Spark with data sources, such as click streams, to real-time applications that use sensor data.Figure 4 shows the Cloudera Enterprise key capabilities that meet the functional requirements of customers.Figure 4. Cloudera Enterprise key capabilities12Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

Cloudera ComponentsCloudera Enterprise solution contains the following components: Analytic SQL: Apache ImpalaImpala is the industry’s leading massively parallel processing (MPP) SQL query engine that runsnatively in Hadoop. Apache-licensed, open source Impala project combines modern, scalable paralleldatabase technology with the power of Hadoop, enabling users to directly query data stored in HDFSand Apache HBase without requiring data movement or transformation. Impala is designed from theground up as part of the Hadoop system and shares the same flexible file and data formats,metadata, security, and resource management frameworks that are used by MapReduce, ApacheHive, Apache Pig, and other components of the Hadoop stack. Search Engine: Cloudera SearchCloudera Search is Apache Solr that is integrated with Cloudera Enterprise, including ApacheLucene, Apache SolrCloud, Apache Flume, Apache Tika, and Hadoop. Cloudera Search alsoincludes valuable integrations that make searching more scalable, easy to use, and optimized fornear-real-time and batch-oriented indexing. These integrations include Cloudera Morphlines, which isa customizable transformation chain that simplifies loading any type of data into Cloudera Search. NoSQL - HBaseA scalable, distributed column-oriented datastore. HBase provides real-time read/write randomaccess to very large datasets hosted on HDFS. Stream Processing: Apache SparkApache Spark is an open source, parallel data processing framework that complements Hadoop tomake it easy to develop fast, unified big data applications that combine batch, streaming, andinteractive analytics on all your data. Cloudera offers commercial support for Spark with ClouderaEnterprise. Spark is 10 – 100 times faster than MapReduce which delivers faster time to insight,allows inclusion of more data, and results in better business decisions and user outcomes. Machine Learning: Spark MLlibMLlib is the API that implements common machine learning algorithms. MLlib is usable in Java,Scala, Python and R. Leveraging Spark’s excellence in iterative computation, MLlib runs very fast,high-quality algorithms. Cloudera ManagerCloudera Manager is the industry’s first and most sophisticated management application for Hadoopand the enterprise data hub. Cloudera Manager sets the standard for enterprise deployment bydelivering granular visibility into and control over every part of the data hub, which empowersoperators to improve performance, enhance quality of service, increase compliance, and reduceadministrative costs. Cloudera Manager makes administration of your enterprise data hub simple andstraightforward, at any scale. With Cloudera Manager, you can easily deploy and centrally operate thecomplete big data stack.Cloudera Manager automates the installation process, which reduces deployment time from weeks to13Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

minutes; gives you a cluster-wide, real-time view of nodes and services running; provides a single,central console to enact configuration changes across your cluster; and incorporates a full range ofre

Enterprises are incorporating large data lakes into their IT architecture to store all their data. The expectation is that ready access to all the available data can lead to higher quality of insights obtained through the use of analytics, which in turn drive better business decisions. A key challenge faced today by these enterprises is