Some organizations may need or prefer to install libraries locally rather than fetch them
from Maven. This can make it easier to provide a common configuration to multiple local
users. In order to accomplish this, you’ll need to generate an “assembly” using the Mothra
build scripts in a location with Internet access. You can do this by running ./mill
show 'mothra.cross[2.12.20].assembly'
, and then making a copy of the output jar file
produced using a name like mothra_2.12-1.7.0-full.jar
. (Using whatever version
numbers are appropriate for your installation.)
The easiest way to run the spark-shell
CLI interactive tool using Mothra
libraries is to include the mothra_2.12-1.7.0-full.jar
file in the
--jars
option to spark-shell
:
$ spark-shell --jars path/to/mothra_2.12-1.7.0-full.jar
If you have additional jars for this list, they should be included in a single
--jars
option, separated by commas, like:
$ spark-shell --jars path/to/first.jar,path/to/second.jar
The --jars
switch may also be used with the spark-submit
command
for non-interactive jobs that need the Mothra libraries.
When you use --jars
, Spark will automatically copy the jar files to every node
that needs them for the job (for either spark-shell
or
spark-submit
).
You can also make a version of Mothra available as a system-wide default by editing your
spark-defaults.conf
file to include:
spark.driver.extraClassPath /full/path/to/mothra_2.12-1.7.0-full.jar
spark.executor.extraClassPath /full/path/to/mothra_2.12-1.7.0-full.jar
As with the command-line options, these are comma-separated lists, and you'll want to preserve any existing contents.
When using spark-defaults.conf
, Spark will not automatically copy the
jar file for you. You must make sure that the jar file is present at the same expected path
on every machine that might need it. (This includes any machine where either a Spark driver
or a Spark executor might be run.)
Note that while making Mothra available system-wide in a Spark installation should cause everything that uses that Spark installation to work with Mothra, some notebook server configurations may by default use their own Spark installation. You will want to consult the documentation for your notebook server to determine how you might configure it to use the system-wide Spark installation, or how you might configure its built-in Spark installation to use the locally install Mothra libraries.
These so-called "full" jars contain embedded dependencies so that no additional files must be downloaded to use Mothra. It is possible for this to produce a version mismatch which needs to be reoslved. If this is the case, please let us know about your build and runtime versions in any bug report.
A quick and easy way to test using Spark with Mothra using this jar is to run the following command, which will load the Mothra libraries (and demonstrate the available version):
$ spark-shell --jars mothra_2.12-1.7.0-full.jar
...
scala> org.cert.netsa.util.versionInfo("mothra")
res0: Option[String] = Some(1.7.0)
scala> org.cert.netsa.util.versionInfo.detail("mothra")
res1: Option[String] = Some(mothra (Scala 2.12.20))
Examine the version number to make sure you're using the correct version of Mothra. If it doesn't match the version you think that you are trying to test, it might indicate that a previously installed version is taking priority, and you should check your configuration.
If something goes wrong, looking at the version details shows you precisely what version of Scala was used for building and testing this jar file. In the above example, Scala 2.12.20 was used.
(Note that Scala versions with the same first two numbers should always be binary compatible, but if the second number differs it will not be compatible. (That is: 2.13.5 and 2.13.16 are comptaible. 2.12.1 and 2.12.20 are compatible. 2.12.x and 2.13.x are not.)
Next, define a tiny sample query to make sure that Spark can construct a query correctly:
scala> import org.cert.netsa.mothra.datasources._
import org.cert.netsa.mothra.datasources._
scala> val df = spark.read.ipfix("fccx-sample.ipfix")
df: org.apache.spark.sql.DataFrame = [startTime: timestamp, endTime: timestamp \
... 17 more fields]
(This specific data file is available in Sample Data, but any IPFIX file will do.)
If you are using a version of Spark with a different version of Scala than Mothra (for example, you're using Scala 2.13, but the full jar was built for 2.12), you are very likely to encounter an error here. Something like:
java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
at org.cert.netsa.mothra.datasources.fields.Field.<init>(Field.scala:6)
...
Next, you should try to actually run a query to make sure that data is decoded correctly:
scala> df.show(1, 0, true)
-RECORD 0-----------------------------------------
startTime | 2015-09-14 14:55:20.568
...
only showing top 1 row
scala> df.count
res2: Long = 428
If nothing up to this point causes a failure, it's likely there's a deeper problem than dependency issues. Either way, please report the results of your investigation to us with your bug report. We can use it to investigate the source of the problem. It would also be very helpful if you could include the output from the following command with your bug report:
$ spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.4
/_/
Using Scala version 2.13.8, OpenJDK 64-Bit Server VM, 17.0.12
Branch HEAD
Compiled by user yangjie01 on 2024-12-17T04:17:18Z
Revision a6f220d951742f4074b37772485ee0ec7a774e7d
Url https://github.com/apache/spark
Type --help for more information.