[ANN] Douglass.jl -- Stata-like interface to Julia DataFrames

Hi all,

I’ve been using Julia for model estimation in a couple of papers, but always found myself having to go back-and-forth between Stata (for data manipulation) and Julia (for accessing JuMP or the solvers), sometimes using very ugly hacks.

So I’ve decided to write a little package, called Douglass.jl, that implements a Stata-like syntax to do basic data manipulation on Julia DataFrames. It parses the command and calls a macro that returns the corresponding code from DataFrames.jl or DataFramesMeta.jl that does the task. That means it lives in the current scope, and you can use any functions or variables in the expressions (think gen myvariable = myfunction(x)). Besides that, you can use syntax that is very similar to Stata’s:

using Douglass, RDatasets
df = dataset("datasets", "iris")
# set the active DataFrame
Douglass.set_active_df(:df)

# create a variable `z` that is the sum of `SepalLength` and `SepalWidth`, for each row
d"gen :z = :SepalLength + :SepalWidth"
# replace `z` by the row index for the first 10 observations
d"replace :z = _n if _n <= 10"
# drop a variable
d"drop :z"
# construct the within-group mean for a subset of the observations
d"bysort :Species : egen :z = mean(:SepalLength) if :SepalWidth .> 3.0"

and so on.

The package is still in very early stages and hence not yet ready for use in research papers. I’m trying to get a sense for whether the package would be useful for other people besides me, and may or may not invest time into this depending on that. Please consider giving a ‘thumbs up’, Github star etc if you feel it could be useful. Of course I would also appreciate people trying it out and giving feedback. Please file bugs in the ‘issues’ tab on the Github repo, or post your thoughts below.

The package is not yet registered, so you have to install it with

] add https://github.com/jmboehm/Douglass.jl.git

Best, Johannes

1 Like

Thanks @jmboehm, I think this is a very nice idea – useful to many economists. We’ll put out a tweet about it.

1 Like

Johannes,
Amazing! Are you planning to connect up reg etc as well? And to FixedEffects.jl?

And make sure to convert all Julia missing values to some large numeric value so that the Julia code can properly replicate all that buggy stata code using comparisons (https://www.stata.com/support/faqs/data-management/logical-expressions-and-missing-values/) :blush:

I wasn’t planning to, because (1) both have a nice and simple interface; (2) more dependencies makes it more costly to maintain the package. But if there is demand for such things, I’ll think of a good way to separate the parser and interface from the implementation of the commands so that people can write their own implementations of reg etc.

I’m still busy implementing non-seedable random row ordering in bysort :wink:
Ironically it would be really easy to implements Stata’s different missing values in Julia’s, but… hell no :slight_smile:

I don’t know if there is. I was just curious what the scope of the project was.

I didn’t know about that beauty!