Easy Code for Beginners in Data Analysis

Python is considered a high-level language and is quite versatile in analysis. Packages such as pandas are very useful and help make life much easier. Here is some code that a beginner data analyst should know to make their early wrangling effective. It is part of the programmatic assessment of data and sees where your eyes can not.

Let's go!

The most important thing to do is to get a look at your data so that you can work on it properly. The point of what you are doing is to gain some insight into the data. Some easy code to help do that is here:

.head ( )

This code shows you the first five lines of your DataFrame by default. You can see your index, columns and headings when you use this. Sometimes, common patterns in the data show up here. Adding any number in the brackets will show that number of chronological first lines. Keep in mind the window size for seeing them all at a glance with iPython labs differs. For instance, in Jupyter NB, the first 25 lines only can be viewed at once, even if the code reads .head(57)

DataFrame.head()

DataFrame.head(15)

.tail ( )

This shows the last five lines of your DataFrame by default. If the data is chronological, it will show either the first or the last entry. Adding any number of rows to show in the brackets will follow the same pattern as .head( ) and display them

DataFrame.tail()

DataFrame.tail(15)

.info( )

This is one of the most comprehensive code that lists all columns, their null state, data type and the size of the DataFrame

DataFrame.info( )

.sample ( )

This is arguably one of the best ways of visually assessing your data programmatically. This code shows random samples from the DataFrame. At default, it produces one sample row. When a parameter is passed, it displays how many random row samples are needed with regard to the buffer/window size for the display

DataFrame.sample( )

DataFrame.sample(13)

.describe ( )

This is one of the cleanest elementary summaries one can give their data. It creates a table of columns with insightful information such as the mean. maximum and minimum values, standard deviation, percentiles …

DataFrame.describe( )

After looking at your data, you will need to clean it up so that everything looks good, and any problems with it will come from the quality of the data and not because it is messy. Some code that does that is here:

.copy ( )

Most clean analysts use this code to create a copy of their DataFrame, then use it as their sandbox. It prevents alterations to the original data so that there's something to go back to without restarting an entire scraping process or kernel

sandbox_dataframe = DataFrame.copy( )

.duplicated ( ).sum ( )

While cleaning up data, it's a good idea to avoid dealing with duplicates so that results that go into training and actual use remain objective in their findings. The first part of the code, .duplicated( ), shows the True or False statement for every row. The service part of the code, .sum( ), sums up the occurrences of duplicates

DataFrame.duplicated( )

DataFrame.duplicated( ).sum( )

.T

A DataFrame closely resembles a matrix and, as such, transposing is an operation that can be easily carried out on it. Transposing helps with plotting and understanding data

DataFrame.T

.reset_index ( )

A handy code, especially when dropping certain columns and rows, this code allows the resetting of the index to reflect the new number of rows present and can be used to make another column the index. It takes any number of arguments e.g. inplace, level, col_level, etc

DataFrame.reset_index( )

Once data has been tidied in your first project, you can try these out:

.plot ( ) / .hist ( )

This is one of the best ways of observing your data. You do not need to have all your data tidied when using it, as it gives you an overview at a glance. It works both ways.

.plot(figsize = (x, y)) / .hist(figsize = (x, y))

.unstack ( )

This is an interesting bit of pandas when you are creating multivariate plots and want them to sit next to each other according to your legend.

activity = df.groupby('col_a')['col_b'].value

activity.unstack()

.to_csv (index = False)

Classic save of your table as a CSV never goes wrong when you state if you want it to automatically add an index for you or not. Most times, you do not want that. The quick way to go about it is to set the index to False and keep things as you had them.

DataFrame.to_csv(index = False)

As you know, the above is not the only code used in data analysis, but some of the code that will make your wrangling attempts a lot easier. It is also imperative to note that sometimes your visual assessment can pick up things programmatic assessment will not.

To understand in more detail how the above code works, you can visit the official Python docs (or this) to get the full keywords, arguments and understanding you will be needing.

For more interesting beginner code, you can have a look at my GitHub repo and reach out.