All Things Big Data Cloud

Solution Blueprint for Big Data on the Cloud Proof-Of-Concept

AWS documentation is extremely detailed. So wherever possible, I am going to provide the links to aws documenation/tutorials instead of rewriting the stuff.
1. First and foremost we need to have AWS account, have the security details handy and create three S3 buckets (I created mine as bigdatapocinput, bigdatapocoutput and bigdatapocemrlogs). Please follow this detailed Video to get these done AWS Account setup and S3 bucket creation Before Proceeding further, you need to have these things:

An AWS account is created
Create a key pair
Have the details of AccessKey and SecurityKey
Three S3 buckets (one for input, one for output and one for EMR logs)

2. Next step is to make sure we have the latest version of Java SDK installed in our server. We will use eclipse to create S3 utility class, so install AWS SDK for eclipse. Here is a very detailed video to do that.AWS SDK for Java
3. Create a new project using sample S3 project

AWS Java Project

4. Make sure the AWSCredentials.properties in the classpath has the right details of accessKey and secretKey
5. Delete the S3Sample Java class that comes with the template project and create a new one (S3Utility), simplified with the below code.

                
import java.io.File;
                
import java.io.IOException;
                
import com.amazonaws.AmazonClientException;
                
import com.amazonaws.AmazonServiceException;
                
import com.amazonaws.auth.ClasspathPropertiesFileCredentialsProvider;
                
import com.amazonaws.services.s3.AmazonS3;
                
import com.amazonaws.services.s3.AmazonS3Client;
                
import com.amazonaws.services.s3.model.PutObjectRequest;
                
public class S3Utility {
                
 public static void main(String[] args) throws IOException {
                
 AmazonS3 s3 = new AmazonS3Client(new ClasspathPropertiesFileCredentialsProvider());
                
 String bucketName = args[0];
                
 String key = args[1];
                
 File inputfile = new File(args[2]);
                
 try {
                
 s3.putObject(new PutObjectRequest(bucketName, key, inputfile));
                
 } catch (AmazonServiceException ase) {
                
 System.out.println(ase.getMessage() + ase.getStatusCode() + ase.getErrorCode() +
                ase.getErrorType() + ase.getRequestId());
                
 } catch (AmazonClientException ace) {
                
 System.out.println(ace.getMessage());
                
 }
                
 }
                
}

The project should look like this.

AWS Java Project

6. Export the project as a Runnable JAR file.
7. Get our Data source file (EOD Data) and unzip to the input folder. The path is Bhav copy from BSE – End of the Day file
8. Let's run the S3Utility as the executable JAR is ready, as shown above, three parameters are needed (in order) S3 Bucket name, S3 object key (use the file name) and file path in the local server. As mentioned earlier, I have already downloaded the End-of-Day file, unzipped and stored it in input folder.

So, my execution command is:

                
praveen@localhost:-$ java -jar bigdatapoc.jar bigdatapocinput/eoddata EQ121113.CSV EQ121113.CSV

9. Schedule this command through a shell script to execute everyday/every hour etc., based on your input frequency for the Analytics. In my demonstration, I am scheduling this for every evening at 06:00 PM
10.Now, let's create a AWS Data Pipeline to Lauch an EMR and run analytics.

Previous Steps Next Steps