There is a common beginner question for engineers starting out with Big Data. An engineer will post to social media saying “I need to know which Big Data technology to use. I have 3 billion rows in 10,000 files. The whole dataset is 100 GB. Is Big Data Technology X efficient for processing this?”
The short answer is no. The long answer is more than likely no and only a qualified data engineer can tell you for sure.
The issue starts with a misunderstanding of what Big Data is and isn’t. The original poster is assuming that small data technologies can’t do something for them. After all, 3 billion rows sounds like a lot. It isn’t.
If you think about it, you can easily provision a VM with 256 GB of RAM. For a dataset of 100 GB, the entire dataset could fit in memory. There are some nuances like how much this dataset will grow and the complexity of the processing, but this probably isn’t a Big Data problem.
On threads with answers to these questions, there is often another person who responds that the use case doesn’t need Big Data. Sometimes, the original poster will get insulted or think that people are belittling their use case. They aren’t.
This is because their use case would be so much better off in a “small” data technology like a cloud data warehouse as a data store. Using a technology with a relational structure instead of a Big Data technology like the Hadoop ecosystem has these major benefits:
- Less conceptual complexity
- More prevalent in the marketplace
- More people know the technology
- Easier operationally
- Faster speeds of queries
- Cheaper operationally, technically, and people-wise
- Shorter development cycles
When someone is telling you that your use case is small data, they aren’t belittling you or your use case. They’re saving you time, money, and effort.
If you do have Big Data problems, you are specifically held back by a small data technology limitation. You are saying “can’t” because you are hitting a known technical limitation. Namely:
- You’re a manager and you ask for a new feature or a report and the technical person says they can’t due to a technical limitation.
- You’re a developer and you can’t add new features because the database or data warehouse will fall over and die.
- You’re an analyst and you can’t do your report because it would take too long or process too much data.
- You’re a Data Warehouse Engineer and you still can’t do the most intensive queries because they take too much time and resources to run.
These problems often accompany a scale of 100s of billions of rows or petabytes of data. For these problems, you will need highly-trained data engineers.
I’ve seen companies succeed with Big Data in the following ways:
- Allowing enough time to have a sane project plan
- Having realistic expectations for what Big Data would do for the company
- Spending the money on excellent training
- Getting the team the mentoring and help they need
- Realizing Big Data is a complex animal
And I’ve seen companies fail in the following ways:
- Thinking Big Data is the silver bullet that will save the company from itself
- Rushing through the process and not giving the team the time and resources to succeed
- Thinking the team can just read some books or watch some YouTube videos to learn Big Data
- Cheaping out on training and help for the team
- Having a team without the right skills
Remember that even if your organization does have Big Data use cases, not every data-related use case within your organization is a Big Data one. You can simultaneously have small data and Big Data use cases coexisting within the same organization, and the two should be approached somewhat differently. Don’t hit a fly with a sledgehammer – using Big Data technologies for small data will bring a high expense with little reward.
If you’re running a business that needs help with your Big Data strategy, you can read about my mentoring service.