Big data is so big you have To Rethink It
What are the current market trends you see shaping the Hadoop space? What is your take on incorporating those trends and make them effective through your solutions?
Most recently, as we speak with customers about their big data deployments, we are hearing they need solutions addressing both their storage and analytics requirement. Growth in the analytics tier of the ecosystem and indeed non-linear growth in both the data and analytics aspects driven by new use cases are creating issues for Hadoop at scale.
On the “Big Data” front, HDFS is becoming the active archive target for traditional EDW and Analytics platforms (e.g. HANA). Driven by technologies such as SAP Vora, which allows aggregated querying between hot tier analytics / EDW and cold tier HDFS storage, organizations are looking to keep years of transactional data and ETL history online. We’re also seeing a growth in the requirement for lower latency, higher performance storage tiers with technologies such as Kudu. Not all Big Data is the same; most organizations are wrestling with creating a data repository which can handle multiple data temperatures and latencies.
Not all Big Data is the same; most organizations are wrestling with creating a data repository which can handle multiple data temperatures and latencies
On the analytics front, it’s not just about data at rest. Analytics on data-in-motion, in- memory and from low latency NoSQL stores is critical to respond to the real-time demands being placed on big analytics workloads. Architectures such as SMACK are now being considered as alternatives to the traditional Hadoop stack, or in concert with Hadoop, with HDFS and object storage being targeted for archival purposes. Spark adoption is growing rapidly. Machine learning use cases are growing, capitalizing on hardware-based offload engines (GPUs, ASICs) and in-memory architectures such as Spark. We see it in multiple industries, from oil and gas to automotive (connected cars) and retail (e.g. “Amazon Go”) rely upon processing and acting upon video / sensor data in real time.
Last, but not least, operational concerns are paramount for organizations that now have Hadoop deployed in several pockets / silos around their data center, but don’t have a rationalized Big Data strategy. Managing lifecycle, considering traditional IT operational concerns (security, version control, non-disruptive updates, data availability for mission-critical data, etc.) are primary considerations.
What are we doing at HPE? Well, we provide a rich portfolio of platforms optimized to various workloads, from industry standard rackmount platforms such as the ProLiant DL360 and DL380, to our Apollo family of density-optimized platforms targeted at handling both storage needs for all data temperatures and compute/memory needs of Big Analytics, as well as IoT-optimized platforms in our Edgeline family. Further, our Elastic Platform for Analytics features both traditional and disaggregated asymmetric reference architectures which allow for independent scaling of compute and storage resources, which optimizes rack density and power, helping to eliminate cluster sprawl and workload isolation. We also have tools and the experience of managing large HPC clusters, and these tools (such as HPE Insight CMU) provide benefits which directly translate to managing the infrastructural elements of Big Data clusters.
Nowadays, a lot of hype is forming around NoSQL technology and both growing and big-fishes in the market are ideating its benefits. What are the advantages of using NoSQL databases for an enterprise, any thoughts this?
Used in the right way NoSQL can be very powerful but like any technology there are pros and cons. NoSQL databases are enabling new types of applications due to their extreme scale and cost effectiveness. Five years ago one of the biggest applications in the world was a stock exchange processing five million trades a day but, as a comparison our team recently showed four million events per second being processed by HBase on a microcluster that fits within eight inches of rack space. This means we are using events that would have been thrown away five years ago to drive decision making as the events occur. NoSQL excels where an application needs to scale in one dimension such as sensor events or shopping baskets but might not be the best tool for storing complex relationships. The shops that have found themselves in trouble are the ones that have tried to use NoSQL where complex relationships must be maintained or have allowed the unstructured nature of a NoSQL to become a dumpster where it is difficult to find important relationships.
What is your take on ensuring data availability?
Big data is so big you have to rethink it. It simply becomes a physics test to manipulate big data the way you do in typical system. Attempting to back up a big data system that is designed for massive ingest will likely prove that the backup can never keep up with the ingest so we move to a strategy of maintaining data availability via replication. When you hold vast amounts of data in the memory of a spark cluster, availability must be provided by a more cost effective means than replication. So what we are seeing going forward is that the data used in big data systems will bifurcate into a hot tier and a cold tier. Software projects are learning to use byte addressable, non-volatile memory for the hot tier for things like Spark’s working data. That data will typically be persisted in a longer term cold storage tier hosted on objectstores and hdfs that will increasingly use techniques like erasure coding to maintain data availability (such as HPE’s offering in partnership with Scality). We believe that the platforms needed for this new model are different than the types of systems that vendors have built in the past. We are now offering very storage dense servers with just the right silicon to efficiently host large objectstores and at the other end we offer servers that are extremely compute, memory and NVM dense for the hot analytic tier.