Posts Tagged ‘data’

Running sums in Stata

Posted in code on February 21st, 2010 by Michael Ewens – 2 Comments

Perhaps it is bad that I didn’t know this before, but the following code for Stata would have saved a week off of my dissertation work.  Suppose that you have data structured like so:

firm_id,date,amount

and you want to create a new variable that is the total amount as of each date for each firm.  In Stata, you simply type:

sort firm_id date
bysort firm_id: gen total_t = sum(amount)

Note the use of ‘gen‘ rather than ‘egen.’ The ‘sum’ command differs by the type of generate command (i.e. gen or egen), so about 500 lines of loops written in Stata code could be condensed in a few lines.  Stata needs to fix the ‘egen’ and ‘gen’ distinction or I need to port more of my projects to R.

Create a lagged sum variable in Stata

Posted in code on May 19th, 2009 by Michael Ewens – 1 Comment

UPDATE: this code is stupid. You can do this in a few lines.

Although Stata has powerful lag operators for both time series and panel settings, it is difficult to construct lagged sums of variables across an unbalanced panel.  Suppose you have T_i observations for i=1,\ldots,N individuals.   Each time period you observe some value like investments, hours worked, etc.   The example I discuss below simply counts the number of times an event has occurred up to time t for each observation.  Here an event is “IPO” where I need the total number of IPOs prior to $latex  t$.  First, I  construct an indicator for the event and generate a simple date variable that ignores the unbalanced nature of the dataset:

gen ipo = (outcome == "IPO")
sort panelid datevar
bysort panelid: gen date = _n
sum date
local total = r(max)
keep panelid date ipo

I save this as a temporary dataset for the unfortunate for loop:

save temp, replace

and reshape for lagged summing:

reshape wide ipo, i(panelid) j(date)

The following for loop exploits the wide nature of the reshaped data and computes a sum of the IPO variable as of date t:

forvalues x = 2(1)`total'{
if(`x'==2){
gen total_ipo2 = ipo1
}
else {
local minx = `x'-1
gen total_ipo`x' = total_ipo`minx' + ipo`minx'
}
}

Now have to reshape back

drop ipo1-ipo`total'
reshape long total_ipo, i(panelid) j(date)
drop if total_ipo == .
sort panelid date

The sums in the for loop above generate one too many observations at the end of each observations time series that we need to drop:

bysort panelid: gen n = _n
bysort panelid: gen nn = _N
drop if n == nn
drop n nn
keep panelid date total_*
sort panelid date

And it is done.  I am sure that there is a more elegant solution….

Using Mechanical Turk for Research

Posted in economics, research on April 5th, 2009 by Michael Ewens – 3 Comments

Amazon’s Mechanical Turk is a services that allows you to hire 100s of people from across the world to tag photos, complete surveys or find websites. According to Wikipedia:

The Amazon Mechanical Turk (MTurk) is one of the suite of Amazon Web Services, a crowdsourcing marketplace that enables computer programs to co-ordinate the use of human intelligence to perform tasks which computers are unable to do. Requesters, the human beings that write these programs, are able to pose tasks known as HITs (Human Intelligence Tasks), such as choosing the best among several photographs of a storefront, writing product descriptions, or identifying performers on music CDs. Workers (called Providers in Mechanical Turk’s Terms of Service) can then browse among existing tasks and complete them for a monetary payment set by the Requester. To place HITs, the requesting programs use an open Application Programming Interface, or the somewhat limited Mturk Requester site.

Why would economists find this service useful? An example from my own work might help. I am collecting all the individuals employed by new VC firms founded from 1992 – 2007. I need demographic information and employment histories for each VC partner. After scraping the web to get the VC firm websites and “team pages” I have a set of locations for an individual to find the online biography of each of some 3000 VC partners. I submit a job to Mechanical Turk that asks the Turk’er to go to the website, find the individual’s biography and copy and paste the text. Further jobs could ask the Turk to read the bio and answer questions like: 1) Does this person have an MBA or PhD? 2) Male or Female? 3) Founder of firm? Unfortunately, submitting HITs to the Turk system is somewhat difficult. Enter Smartsheet.

Smartsheet as a Frontend to Mechanical Turk

Although the Mechanical Turk service has a system for submitting HITs, it is a little cubersome and requires a bit of strange formatting steps. If you value your time just a little (say, $10/hour) I recommend using the project management webapp Smartsheet’s service SmartSourcing. Again, I refer to my research. I created an Excel file with columns like “Name”, “Firm Name”, “Website” and an empty column “Biography.” I upload this file to Smartsheet, select the rows for which I need biography filled in and walk through their SmartSourcing steps. In about 3 minutes I have submitted a HIT to 1000s of workers that will be complete in 12 hours. I can approve or reject the responses while I watch them populate the online spreadsheet.

Such a service is not free. For a $9.95/month fee (or $99/annual…ask them for a non-profit coupon), you get access to SmartSourcing. Then, on top of the standard Turk fees, you have Smartsheet charges:

Any paid Smartsheet subscriber has access to the Crowdsourcing feature. Monthly charges include Amazon fees plus the cost for work performed (number of tasks completed * cents paid per task) and a low Smartsheet processing fee ($.01 + 10% per task completed – usually $10-$30 per 1,000 tasks).

For example, I paid about $7 for 116 biographies of VC partners. It sounds relatively expensive, but this service has:

  • increased the potential sample size of my studies
  • expanded the set of possible control variables
  • gives you the ability to request multiple workers per task for error checking
  • kept me sane by outsourcing mundane data tasks

Graphing mysql data with PHP + jQuery

Posted in visualization on March 24th, 2009 by Michael Ewens – 5 Comments

For our paper on apartment rental discrimination, I maintain a database of apartment rents across 30 US cities for 1 bedrooms or studios.  Once it is in mysql (easy), I can access it with PHP and send the data to a javascript file to make simple interactive graphs.

Average 1 Bedroom and Studio Rents in Major Cities (x-axis is week number since Feb 15th)

Here is how it was done:

1. PHP script to look through city names (I assume the database exists)

2. SQL statement:

SELECT WEEKOFYEAR( DATE ) AS woy, AVG( rent ) , COUNT( * ) FROM city_counts WHERE city = CITYNAME GROUP BY woy

3. For each city create a simple javascript-friendly string in PHP that looks like this:

{"LA": { label: "LA", data: [[8,1194.9206349206],[9,1261.7650969529]]},

4. Put it all together with Flot (this example) + jQuery.

I know very little (if any) Javascript, but got this up in 15 minutes.

Clean Dates With CRSP Data in Stata

Posted in code on March 21st, 2009 by Michael Ewens – Be the first to comment

Here is some simple code to transform those pesky CRSP dates (YYYYMMDD) to Stata compatible dates:
insheet using "sp_daily.csv"
rename caldt date
* Date Transformation gen year = floor(date/10000)
gen month = floor((date - year*10000)/100)
gen day = floor(date - year*10000 - 100*month)
* replace the date
replace date = mdy(month, day, year)
format date %td