Seagate Achieves 60% Cost Savings After Transforming their Enterprise Hadoop using Spark and Presto on Amazon EMR
Seagate Technology LLC engages in the provision of electronic data storage technologies and solutions. Its products and services include network attached storage, high-performance computing, data protection appliances, internal hard drives, backup and recovery services, flash storage, and related solutions. Seagate manufactures 55-60 million Hard Disk Drives each quarter.
The Seagate factory data was derived from testing devices and was processed using Enterprise Hadoop Cluster (EHC). Before coming to Mactores. Seagate was using Hortonworks HDP (Hive, MR, Tez, and HDFS) on-premise to analyze about two petabytes of data.
As part of their IT transformation, Seagate migrated the EHC platform to Amazon EMR as part of the Lift and Shift project. The lift and shift process had a high impact on the cost and performance of EHC on Amazon EMR due to the requirements of Custom SerDe and other dependencies on HDFS. This was about two petabytes with 1500 tables and 11,000 queries/jobs.
The Mactores Data Engineering team worked with Seagate’s EHC team to re-platform Seagate’s EHC from Tez (Hive) to Spark (PySpark) ETL and Presto for their analytics. Seagate ETL jobs were also transformed to use Apache Airflow as an ETL manager and scheduler. Because of Airflow, manageability of the ETL improved and provided a framework to design the ETL. With cost-based optimization, Presto on Hive provided 20-30x performance to analytical queries. Mactores used Snowball Edge to migrate the data from on-premise Hortonworks cluster (HDFS) to S3 and then use AWS CLI S3 copy from multiple factories to upload incremental data.
Spark ETL improved the ETL performance substantially to reduce the cost of operation and Spot Instances and Spot Feet with auto-scalability was used to effectively optimize the cost of running EHC on Amazon EMR. Apache Superset was used for querying and exploration. All the data collected is used to Machine Learning using Spark ML, TensorFlow on EMR.
After working with Mactores to complete the transformation, the Seagate EHC team achieved 60% cost reduction in operation cost of Amazon EMR and 20 to 30x performance improvement. The Data Analytics and Sciences team can now scale up and scale down any number of jobs based on the budget and importance of the specific job.
The robust ETL framework built on Airflow provided Seagate 100% code reusability and high agility to add new tables. All ETL jobs are mow metadata driven and thus 100% code reusability. All clusters are transient and spot instances. With these improvements, Seagate will achieve approximately 60% cost savings from $2.2 million to $780,000.