23

I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS.

I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR?

I want to improve the performance of my task. Which one is better and reliable for me? How to approach towards them? I heard that we can also register our VM setting as it is on AWS. Is it possible?

Please suggest me as soon as possible.

Many Thanks.

Vzzarr
  • 1,940
  • 1
  • 19
  • 36
Bhavesh Shah
  • 2,989
  • 8
  • 47
  • 69

3 Answers3

31

EMR is a collection of EC2 instances with Hadoop (and optionally Hive and/or Pig) installed and configured on them. If you are using your cluster for running Hadoop/Hive/Pig jobs, EMR is the way to go. An EMR instance costs a little bit extra as compared to an EC2 instance. A quick check on Amazon prices today reveals that a small EC2 instances costs $0.08/hour while a small EMR instance costs $0.015/hour extra. In my opinion, it's totally worth paying that extra money to save yourself the hassle of installing and setting up Hadoop (along with Hive and Pig), creating and maintaining and AMI and using it. Moreover, EMR's version of Hadoop and Hive has some patches that are not available (atleast, not yet) on Apache Hive. If you use EC2, you will probably be using Apache Hadoop and Hive (or may be, the cloudera distributions) and wouldn't have access to those patches (like native support for S3 or commands like ALTER TABLE my_table RECOVER PARTITIONS

References:

Mark Grover
  • 3,910
  • 20
  • 21
6

I would suggest that you do NOT try and deploy your own Hadoop cluster, unless you have 2-3 months to spare, and you have a hadoop expert handy.

Elastic MapReduce will allow you to get started very quickly by providing a pre-configured hadoop environment. Seeing as you only have a single job, it should be fine.

Matthew Rathbone
  • 7,784
  • 7
  • 47
  • 74
  • Thats Fine. In my Use Case I want to use SQOOP to import the data from MS SQL Server. I have created a job for it using Hive JDBC to process it. But I have huge data in MSSQL-SERVER (near about in GB's). If I have to run the job daily/weekly basis, then is it efficient to import from SQL-SERVER daily/weekly. If I think to come out this issue and stored this data n S3 then How could I make link between the HDFS and S3. (Because Hive table's data are stored in HDFS in /user/hive/warehouse directory). – Bhavesh Shah Apr 25 '12 at 05:26
2

In general, historically, EMR was pretty far behind the latest versions of Hadoop components, and some were missing entirely. That's the major reason for using another distribution. For example, if you wanted HBase, it wasn't in EMR, but not it is. Today, Spark is absent from EMR. EMR will generally lag.

That said, if you're not using the latest and greatest features, go with EMR.

pwy
  • 21
  • 1