Exploring Facebook data at the command-line (Part II)

In this project, we’re going to mine a data set generated by using a Facebook scraper on a particular Facebook page (undisclosed).

The goal of this experiment is to find the most vibrant status message on that page, with just one Bash command. You should download the data from below. Let’s save the data as: facebookdata.csv
.
wget https://www.scientificprogramming.io/datasets/facebookdata.csv
In the previous (part I) lesson, we have learned how the csvcut
command can help us to cut a given set of columns (e.g., 1,4,7–11
). Note that we have not previewed the column numbers 2 and 3, which are wider columns and wouldn’t fit properly into our preview-screen above!

Let us now calculate how many entries for each of the status types in the dataset. Status types are in the column 4. To extract column #4
from our file, we can make use of the csvcut
again as follows:
$ csvcut -c 4 facebookdata.csv | sort | uniq -c
Here, the command-line option -c
specifies which column to extract (or cut out). Note that, despite its name, the cut
command does not modify the original file it acts on. Now, we would like to count how many types of entry came from each status type. Here we need the command uniq -c
to count (hence the -c
) how many unique appearances of each status type. However, uniq -c
requires the input to be ‘sorted’, so the first step was to sort
the list of status types.

Find the most popular status entry!
To do this analysis efficiently, we’ll use the command line language called awk , a tool that allows you to filter, extract and transform data files. awk is a very useful tool to put in your bag of tricks. To start, let’s look at a very simple awk program to output every line of our facebook.csv
file, where we specify the delimiter of the file (comma) using the -F
option.
Since the data set has quoted (”text”
) cells we will use csvcut
to extract the required columns, e.g., we want to extract the column 1,8–15
into a file called fbreactions.csv
first.
csvcut -c 2,8-15 facebookdata.csv > fbreactons.csv
The idea is to sum-up all the reactions: like, share, love, etc. (columns 8 + … + 15
) on each FB status and then find the status which had the maximum number of reactions. To calculate the total number of reactions on each entry (status), all we need to do is horizontally add up all the numbers from the columns 8–15
and we do this easily with awk, as follows:
$ awk -F "," '{ total = total + $2 + $3 + $4 + \
$5 + $6 +$7 +$8 +$9; print $1"," \
total; total=0 }' fbreactons.csv |head
Let’s pay attention to the awk statetment, which not only sums up the columns side by side, but also on each line prints two output (status id
and total
number of reaction on that row). Finally, at the end of each iteration, it nulls the total=0
.
Find the Status with #MAX reactions

To get the status with maximum reactions, next, we sort the status ids, based on the number of reactions (column 2) using the sort -n -r -t”,” -k 2
function, which tells the system to sort out the piped (|
) output numerically (-n
), on the column 2 (-k 2
) which is delimited by a comma (,
):
$ awk -F "," '{ total = total \
+ $2 + $3 + $4 + $5 + $6 +$7 +$8 +$9; \
print $1"," total; total=0 }' \
fbreactons.csv | sort -n -r -t"," -k 2 \
| head -n 1
The final output, tells us that the status id: 7331091005.. had the maximum number of reaction of total 668121. If we now use grep
, we can easily find the message which had the largest number of reactions.
cat facebookdata.csv | grep 7331091005_10154089857531006

This way find the most popular message on that page which was:
LeBron and the Cavs are tired of being bullied
How interesting it is?
[This project is a part of the ‘Learn of Analyze Data in Bash Shell and Linux’ course.]