Exploring Facebook data at the command-line (Part II)

Published in

codeburst

4 min readSep 15, 2017

In this project, we’re going to mine a data set generated by using a Facebook scraper on a particular Facebook page (undisclosed).

We use a data set generated by using a simple Facebook scraper. Image via unsplash.com

The goal of this experiment is to find the most vibrant status message on that page, with just one Bash command. You should download the data from below. Let’s save the data as: facebookdata.csv.

wget https://www.scientificprogramming.io/datasets/facebookdata.csv

In the previous (part I) lesson, we have learned how the csvcut command can help us to cut a given set of columns (e.g., 1,4,7–11). Note that we have not previewed the column numbers 2 and 3, which are wider columns and wouldn’t fit properly into our preview-screen above!

Count the no. of uniqe status types @ Bash Shell

Let us now calculate how many entries for each of the status types in the dataset. Status types are in the column 4. To extract column #4 from our file, we can make use of the csvcut again as follows:

$ csvcut -c 4 facebookdata.csv | sort | uniq -c

Here, the command-line option -c specifies which column to extract (or cut out). Note that, despite its name, the cut command does not modify the original file it acts on. Now, we would like to count how many types of entry came from each status type. Here we need the command uniq -c to count (hence the -c ) how many unique appearances of each status type. However, uniq -c requires the input to be ‘sorted’, so the first step was to sort the list of status types.

Find the most popular status entry!

To do this analysis efficiently, we’ll use the command line language called awk , a tool that allows you to filter, extract and transform data files. awk is a very useful tool to put in your bag of tricks. To start, let’s look at a very simple awk program to output every line of our facebook.csv file, where we specify the delimiter of the file (comma) using the -F option.

Since the data set has quoted (”text”) cells we will use csvcut to extract the required columns, e.g., we want to extract the column 1,8–15 into a file called fbreactions.csv first.

csvcut -c 2,8-15 facebookdata.csv > fbreactons.csv

The idea is to sum-up all the reactions: like, share, love, etc. (columns 8 + … + 15) on each FB status and then find the status which had the maximum number of reactions. To calculate the total number of reactions on each entry (status), all we need to do is horizontally add up all the numbers from the columns 8–15 and we do this easily with awk, as follows:

$ awk -F "," '{ total = total + $2 + $3 + $4 + \  
 $5 + $6 +$7 +$8 +$9; print $1"," \
 total; total=0 }' fbreactons.csv |head

Let’s pay attention to the awk statetment, which not only sums up the columns side by side, but also on each line prints two output (status id and total number of reaction on that row). Finally, at the end of each iteration, it nulls the total=0.

Find the Status with #MAX reactions

To get the status with maximum reactions, next, we sort the status ids, based on the number of reactions (column 2) using the sort -n -r -t”,” -k 2 function, which tells the system to sort out the piped (|) output numerically (-n), on the column 2 (-k 2) which is delimited by a comma (,):

$ awk -F "," '{ total = total \
  + $2 + $3 + $4 + $5 + $6 +$7 +$8 +$9; \
  print $1"," total; total=0 }' \
  fbreactons.csv | sort -n -r  -t"," -k 2 \
  | head -n 1

The final output, tells us that the status id: 7331091005.. had the maximum number of reaction of total 668121. If we now use grep, we can easily find the message which had the largest number of reactions.

cat facebookdata.csv | grep 7331091005_10154089857531006

This way find the most popular message on that page which was:

LeBron and the Cavs are tired of being bullied

How interesting it is?

[This project is a part of the ‘Learn of Analyze Data in Bash Shell and Linux’ course.]

codeburst

Exploring Facebook data at the command-line (Part II)

Find the most popular status entry!

Find the Status with #MAX reactions

Related works

Learn to Analyze Data in Bash Shell and Linux - Learn Scientific Programming

A simple course demonstrate the use of Bash shell in processing real-world data sets

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in codeburst

Written by Scientific Programming School

No responses yet