Getting Started With Data Analysis in Python
Pandas is a Python package aimed to provide fast and flexible data structures designed to make working with data easy and intuitive. Do you want to load an csv file and easily manipulate the data in it? Do you want to replace missing values on your data or ignore them all together? Do want a quick statistic summary of your data? Well, pandas got you covered.
From simple operations like the above to complex data filtering and slicing, pandas provides a set of tools to make working with data simple and efficient. Pandas aims to be the most powerful and flexible open source data analysis / manipulation tool available in any language.
All the code and data set used in this article
Loading data with pandas is quite easy. The library provides methods to load data from Excel files(xls, xlsx), csv, json, pickle, sql and others. For this example we will be using a mock data generated with mockaroo.
This operation will return a pandas.DataFrame object, a table like data structure that will make it easier for us to manipulate or data set and extract information. From now on df will be the representation of our DataFrame.
Pandas provides some methods to visualize the data we are working on.
Used to visualize the first few rows on our DataFrame, the default value is 5.
Similar to df.head(), will return the last few rows on our DataFrame.
Every pandas DataFrame has an immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects. This works as an index for the table.
Column labels to use for identifying, filter and selecting data. Will default to np.arange(n) if no column labels are provided. When using methods like read_csv() or read_excel() the first line will be used as columns, unless explicitly told otherwise. when reading data from a database, the table’s columns will be used as columns for the DataFrame.
Describe shows a quick statistic summary of your data, on the numeric columns, in our case, only in the id column.
Selecting a single column, which yields a Series, equivalent to df.first_name. Its possible to use methods like df.head() and df.tail() in a partial DataFrame.
Pandas also support python dict like syntax for accessing columns.
Selecting via , which slices the rows.
Selecting on a multi-axis by label.
The df.loc property takes to arguments, the first is the indexes of the slices, since was left open, i.e, df.loc[ : ], there isn’t any row slicing. The second argument is the names of the columns we desire to slice.
We could combine our last 2 examples in one, like this:
For getting fast access to a single value
Pandas has support to Boolean indexing, with that we can filter our data based on the value of a column
We can also filter using multiple values, using the builtin function df.isin()
Working with missing or incomplete data can be trick, but pandas makes it easy.
Pandas assigns missing values with a numpy.NaN value, we can use this information to remove the rows or columns with missing data, or replace the missing values to another of out choosing.
We can use pandas df.dropna() to remove incomplete data from our DataFrame.
There’s a couple of arguments to look when using df.dropna(). First there’s the argument how, witch can receive to values:
df.dropna(how=’any’) or df.dropna(how=’all’)
The argument how=’any’ is the default and will drop any row(or column) with any missing data, the second, how=’all’, will drop any row or column where all values are missing, this can be useful to trim rows or columns from malformed data, like excel files with headers and footers.
The last thing I want to share on this topic is the argument thresh, witch will drop any rows or columns with a n number of missing values.
Comparing this last example with the one where we removed the columns we can see that the column ip_address was preserved, since it didn’t met the threshold necessary to be dropped.
Final Considerations :
Pandas is a really powerful and fun library for data manipulation / analysis, with easy syntax and fast operations. This article is just the tip of the iceberg, is possible to do much more explore the rest of the tools that pandas provides, and I encourage you guys to try it and share your experiences.
Final words :
Hope this helps !!!
Follow me as I write about Algorithms ,Competitive Programming , Python , Web Development,Machine Learning ,Deep Learning and Data Science and don’t waste your time by writing rubbish,irrelevant or long answers . Have a nice day !!!
Happy Coding !!