Archives for July 6, 2015

REThink 2015 Day 5, Big Data


What is Big Data? From below powerpoint by Jeffrey Popyack and William Mongan

We spent this morning talking about Big Data. Big Data is such an integral part of computing, and will essentially revolutionize how we live as a species. Never before have we stored so much information, and we’re now figuring out ways to process this data. It’s super interesting, as it will completely change the way that we do science. Here’s some notes I took (I’ll post the .ppt below):

1 byte is needed to store a single letter, digit, or number “Big Data”= 8 bytes (ASCII)

 The text of Dr. Seuss’ “Green Eggs & Ham” is 3.3 kilobytes in size.


Hadoop Distributed Filesystem (HDFS)

An equal amount of work is not necessarily equal

Latency due to processor speed and distances

Parallel Computing


Big Data Powerpoint- William Mongan and Jeff Popyack

We then used JSMapReduce to try and sort how many states contain what the most common city/town name is. Here’s the site (well, webarchive) of JSMapReduce. The code we ended up using is:

function Mapper(jsmr_context, data)
// separate a line of data into separate entries …
var words_list = data.split(‘\t’);

// extract the city and state name and output as a pair
jsmr_context.Emit(words_list[2], words_list[5]);
function Reducer(jsmr_context, city)
states_map = {}
// (key,value-list) is (city,[state1,state2,state3,…])
var number_of_states = 0;
while (jsmr_context.HaveMoreValues())
var state = jsmr_context.GetNextValue();
// count how many times a new state appears
if( !(state in states_map) )
states_map[state] = 1 ;
number_of_states ++ ;
jsmr_context.Emit(city + ‘:’ + number_of_states) ;

After lunch (at the Shake Shack), I spent some time reading about MATLAB on the internet. Here’s some of the sites I used to learn the basics of MATLAB: