Spark- RDD Creation
|In this blog we will see how to create a standalone java application that will run on spark cluster, we will also learn how to use spark-submit tool
We will be loading a text file from given path and counting the number of records. You can create a RDD by parallelizing a collection or loading any external file. You can also create an RDD by transforming an existing RDD.
SparkConf conf=new SparkConf().setAppName("RDDCreationExample"); JavaSparkContext sc=new JavaSparkContext(conf); //create RDD from an array JavaRDD<String> parNames=sc.parallelize(Arrays.asList("sankar","anshu")); System.out.println("Loading text file from "+args[0]); JavaRDD<String> namesFromFile=sc.textFile(args[0]); System.out.println("parNames Count "+parNames.count()); System.out.println("namesFromFile Count "+namesFromFile.count()); //close the context ? sc.close();
you can get the pom file from here.
To submit the jar run the following command
spark-submit --class net.icircuit.spark.RDDCreation.App RDDCreation-0.0.1-SNAPSHOT.jar file:///home/anshumanthesniper7722/test.txt
by default spark runs the job in local mode, to run the job on yarn you need add –master flag
spark-submit --master yarn RDDCreation-0.0.1-SNAPSHOT.jar /path/to/file/on/hdfs
spark-submit
spark-submit is a single tool to submit jobs to the spark cluster
if you run a script or jar file without any options spark will run the job locally
spark-submit myJob.jar
to run the job on cluster , you need add –master flag
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ <application-jar> \ [application-arguments]
–class: The entry point for your application
–master: The master URL for the cluster (e.g. spark://23.195.26.187:7077 or just yarn incase you want to use yarn)
–deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
–conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any
for more information on spark-submit tool check this page.