The DataFrame type in Julia is not dissimilar to the analogous types in R and Python/pandas. It provides a way of grouping data which is convenient for analysis and reminiscent of a database table.
I’m assuming that you’ve already installed the DataFrames package. If not, take a look at yesterday’s post. The first step is then to load it up:
Next we can start assembling our data. A DataFrame can be built up one field at a time (as is done in the example below) or by passing all of the data at once to the constructor.
names() and eltypes() provide a high level overview of the data, giving the names and data types respectively for each column.
You can dig deeper with describe(), which gives a simple statistical summary of each column. It does essentially the same thing as summary() in R.
Indexing operations allow you to access the data in various ways. There’s also head() and tail(), which return the first and last few records in the data.
You can apply a range of operations to columns. Note, however, that there is a subtle difference in syntax: while == is the normal equality operator, .== is the element-wise equality operator which must be applied to columns in order to make element-by-element comparisons. A similar syntax pertains to other operators like .<= and .>.
Of course you’re not likely to construct any serious collection of data manually. It’s more likely to come from a database or file. There are various ways to accomplish this. The simplest is reading from a delimited file.
Note how names!() was used to alter the column names. There are other ways of loading data from a delimited text file that will handle column names more elegantly. We’ll get to those in a few days time.
Watch the video below and then read further to find out about the DataArrays package.
Data are seldom perfect and missing values are not uncommon. Now, you might use some a particular numerical value (like -9999, for example) to indicate a missing datum. However, this is a bit of a kludge, difficult to maintain and open to ambiguity. The DataArrays package introduces the singleton NA type which can be used to unambiguously indicate missing data.
A vector with missing data is created using the @data macro.
Functions anyna() and allna() can be used to test whether any or all of the elements of a vector are missing.
Two ways of dealing with NAs are to either drop them or replace them with another value.
Data frames have support for NAs already baked in.
Note how dropna() was used to calculate the mean of the non-missing data.
Metaprogramming with a DataFrame
The DataFramesMeta package provides a handful of macros for applying metaprogramming techniques to data frames. For example:
Further examples can be found on the github page for MonthOfJulia.