Calling Scalding from inside your application
Pages 37
- Home
- Aggregation using Algebird Aggregators
- All about reducers count
- API Reference
- Automatic Orderings, Monoids and Arbitraries
- Building bigger platforms with scalding
- Calling Scalding from inside your application
- Common Exceptions and possible reasons
- Comparison to Scrunch and Scoobi
- Field rules
- Fields API: reduce functions of GroupBuilder
- Fields based API Reference
- Frequently asked questions
- Getting Started
- Intro to Scalding Jobs
- Introduction to Matrix Library
- Matrix API Reference
- Pig to Scalding
- Powered By
- REPL Reference
- Rosetta Code
- Run in Intellij IDEA
- Scala and sbt for Homebrew users
- Scala and sbt for MacPorts users
- Scald.rb
- Scalding Commons
- Scalding HBase
- Scalding on amazon elastic mapreduce
- Scalding REPL
- Scalding Sources
- Scalding with CDH3U2 in a Maven project
- SQL to Scalding
- Type safe api reference
- Upgrading to 0.9.0
- Using scalding with other versions of scala
- Using the distributed cache
- Why pack unpack and not toList[]
- Show 22 more pages…
Contents
Getting help
Documentation
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
Matrix API
Third Party Modules
Videos
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
How-tos
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
Tutorials
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding
Articles
- Hive, Pig, Scalding, Scoobi, Scrunch and Spark: A Comparison of Hadoop Frameworks
- Why Hadoop MapReduce needs Scala
- How Twitter is doing its part to democratize big data
- Meet the combo powering Hadoop at Etsy, Airbnb and Climate Corp.
- Scalding wins a Bossie award from InfoWorld
Other
Clone this wiki locally
Starting in scalding 0.12, there is a clear API for doing this. See Execution[T], which describes a set of map/reduce operations that when executed return a Future[T]. See the scaladocs for Execution. Below is an example.
val job: Execution[Unit] =
TypedPipe.from(TextLine("input"))
.flatMap(_.split("\\s+"))
.map { word => (word, 1L) }
.sumByKey
.writeExecution(TypedTsv("output"))
// Now we run it in Local mode
val u: Unit = job.waitFor(Config.default, Local(true))
// Or for Hadoop:
val jobConf = new JobConf
val u: Unit = job.waitFor(Config.hadoopWithDefaults(jobConf), Hdfs(true, jobConf))
// If you want to be asynchronous, use run instead of waitFor and get a Future in returnFor testing or cases where you aggregate data down to a manageable level, .toIterableExecution on TypedPipe is very useful:
val job: Execution[Iterable[(String, Long)]] =
TypedPipe.from(TextLine("input"))
.flatMap(_.split("\\s+"))
.map { word => (word, 1L) }
.sumByKey
.toIterableExecution
// Now we run it in Local mode
val counts: Map[String, Long] = job.waitFor(Config.default, Local(true)).toMapTo run an Execution as a stand-alone job, see:
-
ExecutionApp Make an
object MyExJob extends ExecutionAppfor a job you can run like a normal java application (by using java on the classname). - ExecutionJob - use this only if you have an existing tooling around launching scalding.Job subclasses.
Some rules
-
When using Execution NEVER use
.writeor.toPipe(or call any method that takes an implicit flowDef). Instead use.writeExecution,.toIterableExecution, or.forceToDiskExecution. (see scaladocs). -
Avoid calling
.waitForor.runAS LONG AS POSSIBLE. Try to compose your entire job into on large Execution using.zipor.flatMapto combineExecutions.waitForis the same asrunexcept it waits on the future. There should be at most 1 calling to .waitFor or .run in each Execution App/Job. -
Only mutate vars or perform side effects using
.onComplete. If yourunthe result ofonComplete, your function you pass will be run when the result up to that point is available and you will get theTry[T]for the result. Avoid this if possible. It is here to deal with external IO, or existing APIs, and designed for experts that are comfortable using .onComplete on scala Futures (which is all this method is doing under the covers).
Running Existing Jobs Inside A Library
We recommend the above approach to build composable jobs with Executions. But if you have an existing Job, you can also run that:
Working example:
WordCountJob.scala
class WordCountJob(args: Args) extends Job(args) {
TextLine(args("input"))
.read
.flatMap('line -> 'word) { line: String => line.split("\\s+") }
.groupBy('word) { _.size }
.write(Tsv(args("output")))
}Runner.scala
object Runner extends App {
val hadoopConfiguration: Configuration = new Configuration
hadoopConfiguration.set("mapred.job.tracker","hadoop-master:8021")
hadoopConfiguration.set("fs.defaultFS","hdfs://hadoop-master:8020")
val hdfsMode = Hdfs(strict = true, hadoopConfiguration)
val arguments = Mode.putMode(hdfsMode, Args("--input in.txt --output counts.tsv"))
// Now create the job after the mode is set up properly.
val job: WordCountJob = new WordCountJob(arguments)
val flow = job.buildFlow
flow.complete()
}And then you can run your App on any server, that have access to Hadoop cluster