#MonthOfJulia Day 14: Data
DataFrame type in Julia is not dissimilar to the analogous types in R and Python/pandas. It provides a way of grouping data which is convenient for analysis and reminiscent of a database table.
Next we can start assembling our data. A
DataFrame can be built up one field at a time (as is done in the example below) or by passing all of the data at once to the constructor.
eltypes() provide a high level overview of the data, giving the names and data types respectively for each column.
You can dig deeper with
describe(), which gives a simple statistical summary of each column. It does essentially the same thing as
summary() in R.
Indexing operations allow you to access the data in various ways. There’s also
tail(), which return the first and last few records in the data.
You can apply a range of operations to columns. Note, however, that there is a subtle difference in syntax: while
== is the normal equality operator,
.== is the element-wise equality operator which must be applied to columns in order to make element-by-element comparisons. A similar syntax pertains to other operators like
Of course you’re not likely to construct any serious collection of data manually. It’s more likely to come from a database or file. There are various ways to accomplish this. The simplest is reading from a delimited file.
names!() was used to alter the column names. There are other ways of loading data from a delimited text file that will handle column names more elegantly. We’ll get to those in a few days time.
Watch the video below and then read further to find out about the
Data are seldom perfect and missing values are not uncommon. Now, you might use some a particular numerical value (like -9999, for example) to indicate a missing datum. However, this is a bit of a kludge, difficult to maintain and open to ambiguity. The
DataArrays package introduces the singleton NA type which can be used to unambiguously indicate missing data.
A vector with missing data is created using the
allna() can be used to test whether any or all of the elements of a vector are missing.
Two ways of dealing with NAs are to either drop them or replace them with another value.
Data frames have support for NAs already baked in.
dropna() was used to calculate the mean of the non-missing data.
Metaprogramming with a DataFrame
DataFramesMeta package provides a handful of macros for applying metaprogramming techniques to data frames. For example:
Further examples can be found on the github page for MonthOfJulia.