NAME

rwdedupe - Eliminate duplicate SiLK Flow records

SYNOPSIS

rwdedupe [--ignore-fields=FIELDS] [--packets-delta=NUM]
      [--bytes-delta=NUM] [--stime-delta=FLOAT]
      [--duration-delta=FLOAT]
      [--temp-directory=DIR_PATH] [--buffer-size=SIZE]
      [--note-add=TEXT] [--note-file-add=FILE]
      [--compression-method=COMP_METHOD] [--print-filenames]
      [--output-path=PATH] [--site-config-file=FILENAME]
      {[--xargs] | [--xargs=FILENAME] | [FILE [FILE ...]]}

rwdedupe --help

rwdedupe --help-fields

rwdedupe --version

DESCRIPTION

rwdedupe reads SiLK Flow records from one or more input sources. Records that appear in the input file(s) multiple times will only appear in the output stream once; that is, duplicate records are not written to the output. The SiLK Flows are written to the file specified by the --output-path switch or to the standard output when the --output-path switch is not provided and the standard output is not connected to a terminal.

Note: As part of its processing, rwdedupe re-orders the records before writing them.

rwdedupe reads SiLK Flow records from the files named on the command line or from the standard input when no file names are specified and --xargs is not present. To read the standard input in addition to the named files, use - or stdin as a file name. If an input file name ends in .gz, the file is uncompressed as it is read. When the --xargs switch is provided, rwdedupe reads the names of the files to process from the named text file or from the standard input if no file name argument is provided to the switch. The input to --xargs must contain one file name per line.

By default, rwdedupe will consider one record to be a duplicate of another when all the fields in the records match exactly. From another point on view, any difference in two records results in both records appearing in the output. Note that all means every field that exists on a SiLK Flow record. The complete list of fields is specified in the description of --ignore-fields in the "OPTIONS" section below.

To have rwdedupe ignore fields in the comparison, specify those fields in the --ignore-fields switch. When --ignore-fields=FIELDS is specified, a record is considered a duplicate of another if all fields except those in FIELDS match exactly. rwdedupe will treat FIELDS as being identical across all records. Put another way, if the only difference between two records is in the FIELDS fields, only one of those records will be written to the output.

The --packets-delta, --bytes-delta, --stime-delta and --duration-delta switches allow for "fuzziness" in the input. For example, if --stime-delta=NUM is specified and the only difference between two records is in the sTime fields, and the fields are within NUM milliseconds of each other, only one record will be written to the output.

As of SiLK 3.23, the --stime-delta and --duration-delta switches accept a floating point number to allow for sub-millisecond differences to reflect the nanosecond resolution in added in that release. The argument is still specified in term of milliseconds: use --stime-delta=5000 for 5 seconds, --stime-delta=5 for 5 milliseconds, and --stime-delta=0.005 for 5 microseconds.

During its processing, rwdedupe will try to allocate a large (near 2GB) in-memory array to hold the records. (You may use the --buffer-size switch to change this maximum buffer size.) If more records are read than will fit into memory, the in-core records are temporarily stored on disk as described by the --temp-directory switch. When all records have been read, the on-disk files are merged to produce the output.

By default, the temporary files are stored in the /tmp directory. Because of the sizes of the temporary files, it is strongly recommended that /tmp not be used as the temporary directory, and rwdedupe will print a warning when /tmp is used. To modify the temporary directory used by rwdedupe, provide the --temp-directory switch, set the SILK_TMPDIR environment variable, or set the TMPDIR environment variable.

OPTIONS

Option names may be abbreviated if the abbreviation is unique or is an exact match for an option. A parameter to an option may be specified as --arg=param or --arg param, though the first form is required for options that take optional parameters.

--ignore-fields=FIELDS

Ignore the fields listed in FIELDS when determining if two flow records are identical; that is, treat FIELDS as being identical across all flows. By default, all fields are treated as significant.

FIELDS is a comma separated list of field-names, field-integers, and ranges of field-integers; a range is specified by separating the start and end of the range with a hyphen (-). Field-names are case-insensitive. Example:

--ignore-fields=stime,12-15

The list of supported fields are:

sIP,1

source IP address

dIP,2

destination IP address

sPort,3

source port for TCP and UDP, or equivalent

dPort,4

destination port for TCP and UDP, or equivalent

protocol,5

IP protocol

packets,pkts,6

packet count

bytes,7

byte count

flags,8

bit-wise OR of TCP flags over all packets

sTime,9

starting time of flow (microseconds resolution)

duration,10

duration of flow (microseconds resolution)

sensor,12

name or ID of sensor at the collection point

in,13

router SNMP input interface or vlanId if packing tools were configured to capture it (see sensor.conf(5))

out,14

router SNMP output interface or postVlanId

nhIP,15

router next hop IP

class,20,type,21

class and type of sensor at the collection point (represented internally by a single value)

initialFlags,26

TCP flags on first packet in the flow

sessionFlags,27

bit-wise OR of TCP flags over all packets except the first in the flow

attributes,28

flow attributes set by flow generator

application,29

guess as to the content of the flow. Some software that generates flow records from packet data, such as yaf(1), will inspect the contents of the packets that make up a flow and use traffic signatures to label the content of the flow. SiLK calls this label the application; yaf refers to it as the appLabel. The application is the port number that is traditionally used for that type of traffic (see the /etc/services file on most UNIX systems). For example, traffic that the flow generator recognizes as FTP will have a value of 21, even if that traffic is being routed through the standard HTTP/web port (80).

--packets-delta=NUM

Treat the packets field on two records as being the same if the values differ by NUM packets or less. If not specified, the default is 0.

--bytes-delta=NUM

Treat the bytes field on two records as being the same if the values differ by NUM bytes or less. If not specified, the default is 0.

--stime-delta=FLOAT

Treat the start-time field on two records as being the same if the values differ by FLOAT milliseconds or less. As of SiLK 3.23, the argument may be floating point number to support sub-millisecond differences. If not specified, the default is 0.

--duration-delta=FLOAT

Treat the duration field on two records as being the same if the values differ by FLOAT milliseconds or less. As of SiLK 3.23, the argument may be floating point number to support sub-millisecond differences. If not specified, the default is 0.

--temp-directory=DIR_PATH

Specify the name of the directory in which to store data files temporarily when more records have been read that will fit into RAM. This switch overrides the directory specified in the SILK_TMPDIR environment variable, which overrides the directory specified in the TMPDIR variable, which overrides the default, /tmp.

--buffer-size=SIZE

Set the maximum size of the buffer to use for holding the records, in bytes. A larger buffer means fewer temporary files need to be created, reducing the I/O wait times. The default maximum for this buffer is near 2GB. The SIZE may be given as an ordinary integer, or as a real number followed by a suffix K, M or G, which represents the numerical value multiplied by 1,024 (kilo), 1,048,576 (mega), and 1,073,741,824 (giga), respectively. For example, 1.5K represents 1,536 bytes, or one and one-half kilobytes. (This value does not represent the absolute maximum amount of RAM that rwdedupe will allocate, since additional buffers will be allocated for reading the input and writing the output.)

--output-path=PATH

Write the binary SiLK Flow records to PATH, where PATH is a filename, a named pipe, the keyword stderr to write the output to the standard error, or the keyword stdout or - to write the output to the standard output. If PATH names an existing file, rwdedupe exits with an error unless the SILK_CLOBBER environment variable is set, in which case PATH is overwritten. If this switch is not given, the output is written to the standard output. Attempting to write the binary output to a terminal causes rwdedupe to exit with an error.

--note-add=TEXT

Add the specified TEXT to the header of the output file as an annotation. This switch may be repeated to add multiple annotations to a file. To view the annotations, use the rwfileinfo(1) tool.

--note-file-add=FILENAME

Open FILENAME and add the contents of that file to the header of the output file as an annotation. This switch may be repeated to add multiple annotations. Currently the application makes no effort to ensure that FILENAME contains text; be careful that you do not attempt to add a SiLK data file as an annotation.

--compression-method=COMP_METHOD

Specify the compression library to use when writing output files. If this switch is not given, the value in the SILK_COMPRESSION_METHOD environment variable is used if the value names an available compression method. When no compression method is specified, output to the standard output or to named pipes is not compressed, and output to files is compressed using the default chosen when SiLK was compiled. The valid values for COMP_METHOD are determined by which external libraries were found when SiLK was compiled. To see the available compression methods and the default method, use the --help or --version switch. SiLK can support the following COMP_METHOD values when the required libraries are available.

none

Do not compress the output using an external library.

zlib

Use the zlib(3) library for compressing the output, and always compress the output regardless of the destination. Using zlib produces the smallest output files at the cost of speed.

lzo1x

Use the lzo1x algorithm from the LZO real time compression library for compression, and always compress the output regardless of the destination. This compression provides good compression with less memory and CPU overhead.

snappy

Use the snappy library for compression, and always compress the output regardless of the destination. This compression provides good compression with less memory and CPU overhead. Since SiLK 3.13.0.

best

Use lzo1x if available, otherwise use snappy if available, otherwise use zlib if available. Only compress the output when writing to a file.

Print to the standard error the names of input files as they are opened.

--site-config-file=FILENAME

Read the SiLK site configuration from the named file FILENAME. When this switch is not provided, rwdedupe searches for the site configuration file in the locations specified in the "FILES" section.

--xargs
--xargs=FILENAME

Read the names of the input files from FILENAME or from the standard input if FILENAME is not provided. The input is expected to have one filename per line. rwdedupe opens each named file in turn and reads records from it as if the filenames had been listed on the command line.

--help

Print the available options and exit.

--help-fields

Print the description and alias(es) of each field and exit.

--version

Print the version number and information about how SiLK was configured, then exit the application.

LIMITATIONS

When the temporary files and the final output are stored on the same file volume, rwdedupe will require approximately twice as much free disk space as the size of input data.

When the temporary files and the final output are on different volumes, rwdedupe will require between 1 and 1.5 times as much free space on the temporary volume as the size of the input data.

EXAMPLE

In the following examples, the dollar sign ($) represents the shell prompt. The text after the dollar sign represents the command line.

Suppose you have made several rwfilter(1) runs to find interesting traffic:

$ rwfilter --start-date=2008/02/04 ... --pass=data1.rw
$ rwfilter --start-date=2008/02/04 ... --pass=data2.rw
$ rwfilter --start-date=2008/02/04 ... --pass=data3.rw
$ rwfilter --start-date=2008/02/04 ... --pass=data4.rw

You now want to merge that traffic into a single output file, but you want to ensure that any records appearing in multiple output files are only counted once. You can use rwdedupe to merge the output files to a single file, data.rw:

$ rwdedupe data1.rw data2.rw data3.rw data4.rw --output=data.rw

ENVIRONMENT

SILK_TMPDIR

When set and --temp-directory is not specified, rwdedupe writes the temporary files it creates to this directory. SILK_TMPDIR overrides the value of TMPDIR.

TMPDIR

When set and SILK_TMPDIR is not set, rwdedupe writes the temporary files it creates to this directory.

SILK_CLOBBER

The SiLK tools normally refuse to overwrite existing files. Setting SILK_CLOBBER to a non-empty value removes this restriction.

SILK_COMPRESSION_METHOD

This environment variable is used as the value for --compression-method when that switch is not provided. Since SiLK 3.13.0.

SILK_CONFIG_FILE

This environment variable is used as the value for the --site-config-file when that switch is not provided.

SILK_DATA_ROOTDIR

This environment variable specifies the root directory of data repository. As described in the "FILES" section, rwdedupe may use this environment variable when searching for the SiLK site configuration file.

SILK_PATH

This environment variable gives the root of the install tree. When searching for configuration files, rwdedupe may use this environment variable. See the "FILES" section for details.

SILK_TEMPFILE_DEBUG

When set to 1, rwdedupe prints debugging messages to the standard error as it creates, re-opens, and removes temporary files.

FILES

${SILK_CONFIG_FILE}
${SILK_DATA_ROOTDIR}/silk.conf
/data/silk.conf
${SILK_PATH}/share/silk/silk.conf
${SILK_PATH}/share/silk.conf
/usr/share/silk/silk.conf
/usr/share/silk.conf

Possible locations for the SiLK site configuration file which are checked when the --site-config-file switch is not provided.

${SILK_TMPDIR}/
${TMPDIR}/
/tmp/

Directory in which to create temporary files.

SEE ALSO

rwfilter(1), rwfileinfo(1), sensor.conf(5), silk(7), yaf(1), zlib(3)