Home

TextParse

TextParse uses Julia's generated functions to generate efficient specialized parsers for text files. TextParse minimizes allocations and hence avoids involving the GC.

Installation

Pkg.add("TextParse")

Reading CSV

The most useful API is probably csvread - read a CSV file:

TextParse.csvreadFunction.
csvread(file::Union{String,IO}, delim=','; <arguments>...)

Read CSV from file. Returns a tuple of 2 elements:

  1. A tuple of columns each either a Vector, or StringArray
  2. column names if header_exists=true, empty array otherwise

Arguments:

  • file: either an IO object or file name string
  • delim: the delimiter character
  • spacedelim: (Bool) parse space-delimited files. delim has no effect if true.
  • quotechar: character used to quote strings, defaults to "
  • escapechar: character used to escape quotechar in strings. (could be the same as quotechar)
  • commentchar: ignore lines that begin with commentchar
  • row_estimate: estimated number of rows in the file. Defaults to 0 in which case we try to estimate this.
  • skiplines_begin: skips specified number of lines at the beginning of the file
  • header_exists: boolean specifying whether CSV file contains a header
  • nastrings: strings that are to be considered NA. Defaults to TextParse.NA_STRINGS
  • colnames: manually specified column names. Could be a vector or a dictionary from Int index (the column) to String column name.
  • colparsers: Parsers to use for specified columns. This can be a vector or a dictionary from column name / column index (Int) to a "parser". The simplest parser is a type such as Int, Float64. It can also be a dateformat"...", see CustomParser if you want to plug in custom parsing behavior. If you pass nothing as the parser for a given column, that column will be skipped
  • type_detect_rows: number of rows to use to infer the initial colparsers defaults to 20.
source

Some notable features of the CSV parser are:

Extensible parsing framework

TextParse operates by defining small parsers which are specialized to parse one kind of text very efficiently. Each such parser is described by a subtype of AbstractToken{T}. An AbstractToken{T} type should implement a tryparsenext method:

Dates.tryparsenextFunction.

tryparsenext{T}(tok::AbstractToken{T}, str, i, till, localopts)

Parses the string str starting at position i and ending at or before position till. localopts is a LocalOpts object which contains contextual options for quoting and NA parsing. (see LocalOpts documentation)

tryparsenext returns a tuple (result, nextpos) where result is of type Nullable{T}, Nullable{T}() if parsing failed, non-null containing the parsed value if it succeeded. If parsing succeeded, nextpos is the position the next token, if any, starts at. If parsing failed, nextpos is the position at which the parsing failed.

source

Available AbstractToken types

parse numbers of type T

source

Parses string to the AbstractString type T. If T is StrRange returns a StrRange with start position (offset) and length of the substring. It is used internally by csvparse for avoiding allocating strings.

source
DateTimeToken(T, fmt::DateFormat)

Parse a date time string of format fmt into type T which is either Date, Time or DateTime.

source

NAToken(inner::AbstractToken; options...)

Parses a Nullable item.

Arguments

  • inner: the token to parse if non-null.
  • emptyisna: should an empty item be considered NA? defaults to true
  • nastrings: strings that are to be considered NA. Defaults to ["#N/A", "#N/A N/A", "#NA", "#n/a", "#n/a n/a", "#na", "-1.#IND", "-1.#QNAN", "-1.#ind", "-1.#qnan", "-NaN", "-nan", "-nan", "-nan", "1.#IND", "1.#QNAN", "1.#ind", "1.#qnan", "N/A", "N/A", "NA", "NA", "NULL", "NaN", "n/a", "n/a", "na", "na", "nan", "nan", "nan", "null"]
source

Quoted(inner::AbstractToken; <kwargs>...)

Arguments:

  • inner: The token inside quotes to parse
  • required: are quotes required for parsing to succeed? defaults to false
  • includequotes: include the quotes in the output. Defaults to false
  • includenewlines: include newlines that appear within quotes. Defaults to true
  • quotechar: character to use to quote (default decided by LocalOpts)
  • escapechar: character that escapes the quote char (default set by LocalOpts)
source
CustomParser(f, T)

Provide a custom parsing mechanism.

Arguments:

  • f: the parser function
  • T: The type of the parsed value

The parser function must take the following arguments:

  • str: the entire string being parsed
  • pos: the position in the string at which to start parsing
  • len: the length of the string the maximum position where to parse till
  • opts: a LocalOpts object with options local to the current field.

The parser function must return a tuple of two values:

  • result: A Nullable{T}. Set to Nothing{T}() if parsing must fail, containing the value otherwise.
  • nextpos: If parsing succeeded this must be the next position after parsing finished, if it failed this must be the position at which parsing failed.
source

LocalOpts

LocalOpts

Options local to the token currently being parsed.

  • endchar: Till where to parse. (e.g. delimiter or quote ending character)
  • spacedelim: Treat spaces as delimiters
  • quotechar: the quote character
  • escapechar: char that escapes the quote
  • includequotes: whether to include quotes while parsing
  • includenewlines: whether to include newlines while parsing
source