Introduction to Matrix Library
Pages 37
- Home
- Aggregation using Algebird Aggregators
- All about reducers count
- API Reference
- Automatic Orderings, Monoids and Arbitraries
- Building bigger platforms with scalding
- Calling Scalding from inside your application
- Common Exceptions and possible reasons
- Comparison to Scrunch and Scoobi
- Field rules
- Fields API: reduce functions of GroupBuilder
- Fields based API Reference
- Frequently asked questions
- Getting Started
- Intro to Scalding Jobs
- Introduction to Matrix Library
- Matrix API Reference
- Pig to Scalding
- Powered By
- REPL Reference
- Rosetta Code
- Run in Intellij IDEA
- Scala and sbt for Homebrew users
- Scala and sbt for MacPorts users
- Scald.rb
- Scalding Commons
- Scalding HBase
- Scalding on amazon elastic mapreduce
- Scalding REPL
- Scalding Sources
- Scalding with CDH3U2 in a Maven project
- SQL to Scalding
- Type safe api reference
- Upgrading to 0.9.0
- Using scalding with other versions of scala
- Using the distributed cache
- Why pack unpack and not toList[]
- Show 22 more pages…
Contents
Getting help
Documentation
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
Matrix API
Third Party Modules
Videos
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
How-tos
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
Tutorials
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding
Articles
- Hive, Pig, Scalding, Scoobi, Scrunch and Spark: A Comparison of Hadoop Frameworks
- Why Hadoop MapReduce needs Scala
- How Twitter is doing its part to democratize big data
- Meet the combo powering Hadoop at Etsy, Airbnb and Climate Corp.
- Scalding wins a Bossie award from InfoWorld
Other
Clone this wiki locally
About
Matrix.scala is a Scalding library that introduces the possibility of treating pipes as sparse matrices and to operate on them using standard matrix operations, such as matrix multiplication:
//Computing the innerproduct of matrix A
innerProd = A * A.transposeThe matrix constructor takes in a pipe containing triples that have the assumed semantics of (row index, column index, matrix value). Additionally, the user can specify the approximate dimensions of the matrix (number of rows, columns, non-zero values) and its skewness (if the distribution of values over the row/column keys is skewed or not). This additional information can help speed-up the computation and improve scalability of the resulting job.
Type restrictions
The matrix row and column indexes can be of any type that is comparable. The usual cases are Int, Long, String. This means that labeled matrices are allowed. For example, we can create a matrix containing the number of users that like specific movie genres per geo without first reindexing the categorical fields to numerical ids:
//Loading the number of users interested in movie genres per geo from a Tsv source
val interestsMatrix = Tsv( args("input") ).read.toMatrix[String,String,Long]('geo, 'movie_genre, 'freq)The value type decides what operations can be applied to the matrix.
- Minimally, in order to support addition, the value type T has to have the trait
Monoid[T](as defined in algebird/Monoid). - In order to support subtraction, the value type T has to have the trait
Group[T](algebird/Group). - For multiplication, the value type T has to have the trait
Ring[T]([algebird/Ring] (https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Ring.scala)). - For division, the value type T has to have the trait
Field[T](algebird/Field).
The reason why this is a more powerful approach than to require for the value type to be Numeric is that all of the four operations: addition, subtraction, multiplication, division can be then extended to non-numeric types. For example:
- String addition can be defined as string concatenation
- List addition can be defined as list concatenation
- Set addition can be defined as set union
These operations can be stacked together: For example, Map addition can be defined as set union and the values in the Maps intersection could be aggregated using their own definition of addition. By allowing matrix values to be structured types we can work with higher-order tensors such as cubes or four-tensors with the same library.
For more information on algebraic structures see: Algebraic_structure(Wikipedia) and the pages on Monoid, Group, Ring, Field.
Getting Started. The "Hello World!" example for the Matrix library
Graph nodes outdegrees
Graphs have a straightforward representation in Matrix library as adjacency matrices. We will use the library to compute the outdegrees of the nodes in the graph.
package com.twitter.scalding.examples
import com.twitter.scalding._
import com.twitter.scalding.mathematics.Matrix
class GraphOutDegreeJob(args : Args) extends Job(args) {
import Matrix._
val adjacencyMatrix = Tsv( args("input"), ('user1, 'user2, 'rel) )
.read
.toMatrix[Long,Long,Double]('user1, 'user2, 'rel)
// each row i represents all of the outgoing edges from i
// by summing out all of the columns we get the outdegree of i
adjacencyMatrix.sumColVectors.write( Tsv( args("output") ) )
}We convert a pipe of triples to a sparse matrix where element[i,j] represents and edge between row[i] and column[j]. We then sum the values of the columns together into a column vector that has the outdegree of node[i] at row[i].
Next steps
- Read the Matrix API Reference: includes code snippets explaining different kinds of matrix functions (e.g., sumRowVectors, matrix product, element-wise product, diagonal, topRowElems) and much more.
- Go over the Matrix tutorials: the tutorials range from one-liners to more complex examples that show real applications of the Matrix functions to graph problems and text processing.