Spark- RDD Creation

Sankar Cheppali | August 18, 2016 | Hadoop, Java, Spark | No Comments

In this blog we will see how to create a standalone java application that will run on spark cluster, we will also learn how to use spark-submit tool

We will be loading a text file from given path and counting the number of records. You can create a RDD by parallelizing a collection or loading any external file. You can also create an RDD by transforming an existing RDD.


 SparkConf conf=new SparkConf().setAppName("RDDCreationExample");
 JavaSparkContext sc=new JavaSparkContext(conf);
 
 //create RDD from an array
 JavaRDD<String> parNames=sc.parallelize(Arrays.asList("sankar","anshu"));
 System.out.println("Loading text file from "+args[0]);
 JavaRDD<String> namesFromFile=sc.textFile(args[0]);
 
 System.out.println("parNames Count "+parNames.count());
 System.out.println("namesFromFile Count "+namesFromFile.count());
 //close the context ?
 sc.close();

you can get the pom file from here.

To submit the jar run the following command


spark-submit --class net.icircuit.spark.RDDCreation.App RDDCreation-0.0.1-SNAPSHOT.jar file:///home/anshumanthesniper7722/test.txt

by default spark runs the job in local mode, to run the job on yarn you need add –master flag


spark-submit --master yarn RDDCreation-0.0.1-SNAPSHOT.jar /path/to/file/on/hdfs

spark-submit

spark-submit is a single tool to submit jobs to the spark cluster
if you run a script or jar file without any options spark will run the job locally

spark-submit myJob.jar

to run the job on cluster , you need add –master flag

./bin/spark-submit \
 --class <main-class> \
 --master <master-url> \
 --deploy-mode <deploy-mode> \
 --conf <key>=<value> \
 <application-jar> \
 [application-arguments]

–class: The entry point for your application
–master: The master URL for the cluster (e.g. spark://23.195.26.187:7077 or just yarn incase you want to use yarn)

–deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)

–conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).

application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

application-arguments: Arguments passed to the main method of your main class, if any

for more information on spark-submit tool check this page.