Ben Lorica

Ben Lorica is the Senior Analyst in the Market Research Group at O'Reilly Media, Inc.. He has applied Business Intelligence, Data Mining and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, and Financial Engineering. At O'Reilly, Ben works in the open source data warehouse and analytics area.

An ex-academic, he was an Assistant Professor at U.C. Davis and was the founding Department Chair for Statistics and Mathematics at C.S.U. Monterey Bay.

Twitter and the Micro-Messaging Revolution: Communication, Connections, and Immediacy--140 Characters at a Time Twitter and the Micro-Messaging Revolution: Communication, Connections, and Immediacy--140 Characters at a Time
by Abdur Chowdhury , Gregor Hochmuth , Ben Lorica , Roger Magoulas , Sarah Milstein , Tim O'Reilly
June 2009
Ebook: $99.00

Where 2.0: The State of the Geospatial Web Where 2.0: The State of the Geospatial Web
by Brady Forrest , Ben Lorica , Roger Magoulas , Andrew Turner
June 2009
Ebook: $399.00

Virtual Worlds: A Business Guide Virtual Worlds: A Business Guide
by Ben Lorica , Roger Magoulas
June 2009
Ebook: $249.00 Ebook: $249.00

Recent Posts | All O'Reilly Posts

Ben blogs at:



HBase looks more appealing to data scientists

June 16 2013

When Hadoop users need to develop apps that are “latency sensitive”, many of them turn to HBase1. Its tight integration with Hadoop makes it a popular data store for real-time applications. When I attended the first HBase conference last year, … read more

It’s getting easier to build Big Data applications

June 09 2013

Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera … read more

Tracking the progress of large-scale Query Engines

June 04 2013

As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing … read more

How signals, geometry, and topology are influencing data science

May 24 2013

I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so … read more

Improving options for unlocking your graph data

May 19 2013

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open … read more

11 Essential Features that Visual Analysis Tools Should Have

May 11 2013

After recently playing with SAS Visual Analytics, I’ve been thinking about tools for visual analysis. By visual analysis I mean the type of analysis most recently popularized by Tableau, QlikView, and Spotfire: you encounter a data set for the first … read more

Scalable streaming analytics using a single-server

May 05 2013

For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to scale (ingest) massive amounts of data, … read more

Tachyon: An open source, distributed, fault-tolerant, in-memory file system

April 28 2013

In earlier posts I’ve written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An … read more

Simpler workflow tools enable the rapid deployment of models

April 21 2013

Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized … read more

Single server systems can tackle big data

April 13 2013

About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, … read more

The re-emergence of time-series

April 09 2013

My first job after leaving academia was as a quant 1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and … read more

The re-emergence of Time-series

April 05 2013

My first job after leaving academia was as a quant1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and optimization, … read more

Data Science tools: Are you “all in” or do you “mix and match”?

March 31 2013

An integrated data stack boosts productivity As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs … read more

Python data tools just keep getting better

March 24 2013

Here are a few observations inspired by conversations I had during the just concluded PyData conference1. The Python data community is well-organized: Besides conferences (PyData, SciPy, EuroSciPy), there is a new non-profit (NumFOCUS) dedicated to supporting scientific computing and data … read more

Data Science Tools: Fast, easy to use, and scalable

March 03 2013

Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference. Spark is attracting attention I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions … read more

MLbase: Scalable Machine-learning made accessible

February 22 2013

In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring … read more

An update on in-memory data management

February 21 2013

By Ben Lorica and Roger Magoulas We wanted to give you a brief update on what we’ve learned so far from our series of interviews with players and practitioners in the in-memory data management space. A few preliminary themes have … read more

Need speed for big data? Think in-memory data management

January 18 2013

By Ben Lorica and Roger Magoulas In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries … read more

GraphChi: Graph analytics over billions of edges using your laptop

December 12 2012

GraphChi is a spinoff project of GraphLab, an open source, distributed, in-memory software system for analytics and machine-learning. Designed specifically to run on a single computer with limited memory1 (DRAM), since its release a few months ago GraphChi has been … read more

Shark: Real-time queries and analytics for big data

November 27 2012

Hadoop’s strength is in batch processing, MapReduce isn’t particularly suited for interactive/adhoc queries. Real-time1 SQL queries (on Hadoop data) are usually performed using custom connectors to MPP databases. In practice this means having connectors between separate Hadoop and database clusters. … read more

Spark 0.6 improves performance and accessibility

October 16 2012

In an earlier post I listed a few reasons why I’ve come to embrace and use Spark. In particular I described why Spark is well-suited for many distributed Big Data Analytics tasks such as iterative computations and interactive queries, where … read more

Seven reasons why I like Spark

August 21 2012

A large portion of this week’s Amp Camp at UC Berkeley, is devoted to an introduction to Spark – an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I’ve come to consider it a … read more

Active Facebook users by region: November, 2010

November 16 2010

With Facebook unveiling an integrated messaging system for its more than 500 million users, I decided to update a few charts that breakdown its users by region. read more

Hiring trends among the major platform players

November 15 2010

After recently re-reading Tim's post on the major internet platform players, I looked at recent hiring trends* among the companies he highlighted. First I examined year-over-year changes in number of job postings (from Aug to Oct 2009 vs. Aug to Oct 2010). Consistent with the recent flurry of articles about… read more

Windows Phone apps are more expensive than iPhone apps

November 05 2010

The Windows Marketplace for Mobile now has about 1,400 apps spread across 16 categories. In this short post I'll provide some basic statistics and compare it with the grandaddy of app stores: the U.S. iTunes store. read more

Crowdsourcing Specific Microtasks

October 25 2010

Since the first-ever Mechanical Turk meetup a year ago, there has been an explosion in crowdsourcing services and a well-attended conference in San Francisco. I remain enthusiastic about crowdsourcing, but the number of companies has me worried about quality of work. Fortunately specialization is already occurring, so for particular tasks… read more

Amazon's cloud platform still the largest, but others are closing the gap

August 31 2010

Tim's recent tweet on the growing demand for Google App Engine skills inspired me to measure the popularity of the major cloud computing platforms. Elance is one of many job boards in our data warehouse of U.S. job postings1 , and I wanted to measure demand across many more job… read more

The number of Hadoop jobs continue to rise

August 08 2010

While still a small fraction1 of data management job postings, the number of job posts that mention "hadoop" continue to grow steadily. Year-over-year, there were 300% more such job posts2 in the first seven months of 2010 compared to the same period in 2009: The fraction of "hadoop" jobs posted… read more

Which Social Gaming companies are Hiring

July 29 2010

Disney's announced purchase of Mountain View gaming startup Playdom, follows on the heels of EA's purchase of London-based Playfish last November. Based on active users Zynga remains by far the biggest online social gaming company, but what other independent companies are growing? To see which companies are expanding, I used… read more

Where Facebook's half a billion users reside

July 21 2010

Facebook announced that they now reach 500 million active users (just five and half years after launching). But where do these half a billion users reside? Refreshing my post from February, the share of users from Asia continues to rise and now stands at 17% of all Facebook users. Over… read more

Popular iPhone games stay highly-ranked only for a few weeks

June 30 2010

With 40,000+ Games to choose from, the list of Top 100 free and paid games are frequently scanned by iPhone gamers. In this short post, I'll share some basic statistics on popular games sold through the U.S. iTunes app store. read more

Actually, half of all iPad Books are Fiction

May 05 2010

Suggestions to my previous post inspired me to normalize our metadata1 for titles available through the U.S. iBooks app. A comment prompted me to rollup iBooks publishers into publishing conglomerates2: Comments from other readers gave me the idea to map the 100+ iBooks categories to the more familiar BISAC categories.… read more

A few weeks in, a third of iPad Books are Fiction

April 29 2010

Measured in terms of number of titles, half of the over 46,000 (paid and free) books available through the iBooks app are from 6 categories1. Fiction & Literature alone account for close to a third of all available iBooks titles: The current set of titles is indicative of the publishers… read more

Big Data shakes up the Speech Industry

April 23 2010

I spent a few hours at the Mobile Voice conference and left with an appreciation of Google's impact on the speech industry. Google's speech offerings loomed over the few sessions I attended. Some of that was probably due to Michael Cohen's keynote1 describing Google's philosophy and approach, but clearly Google… read more

Cookbooks: The highest priced iPad book category

April 21 2010

Just like the iTunes app store, the iBooks app on the iPad spotlights the Top Paid (and Top Free) books within each category. Here are some charts that compare the average price (by rank)1 across the major categories. The average price of the Top 50 titles across the major categories… read more

Big Data Analytics: From Data Scientists to Business Analysts

April 19 2010

The growing popularity of Big Data management tools (Hadoop; MPP, real-time SQL, NoSQL databases; and others1) means many more companies can handle large amounts of data. But how do companies analyze and mine their vast amounts of data? The cutting-edge (social) web companies employ teams of data scientists2 who comb… read more

Twitter By The Numbers

April 14 2010

I collected some interesting stats from today's presentations at Chirp. Over a thousand people attended the conference and the numbers below attest to how vibrant the Twitter platform is. Today's announced API enhancements will make the Twitter ecosystem even more interesting: 1. # of registered users: 105,779,710 (1,500% growth over… read more

Games & Entertaiment account for Half of all iPad apps

April 09 2010

98% of apps in the U.S. iTunes app store label themselves as "iPad compatible", but most were written for iPhones or iPods. One week into its launch there are about 2,300 apps† that run only on iPads. Measured in terms of number of unique apps, Games and Entertainment account for… read more

Google's New Marketplace Has over a Thousand Apps

March 17 2010

One week† into its public launch, the Google Apps Marketplace has just under 1,500 (enterprise) apps. Combined with Salesfore.com's app exchange (also with over a thousand apps), enterprises interested in moving to cloud apps have an increasing number of software tools to choose from. Popular apps (measured in terms of… read more

Twitter Users Most Followed by the Web 2.0 Summit Crowd - O'Reilly ...

October 28 2009

I took the set of users† who posted tweets containing the hashtag #w2s and determined who those users followed. Unlike the list of the most followed users in all of Twitter, the list isn't dominated by celebrities... read more

Recent Posts | All O'Reilly Posts

Ben Lorica