Raw cells are not evaluated by the notebook. When passed through nbconvert, raw cells arrive in the destination format unmodified. For example, this allows you to type full LaTeX into a raw cell, which will only be rendered by LaTeX after conversion by nbconvert. Additional Documentation: For Markdown Cells, as quoted from Jupyter Notebook docs.
Pandas is one of the most popular Python libraries for Data Science and Analytics. I like to say it’s the “SQL of Python.” Why?
Because pandas helps you to manage two-dimensional data tables in Python. Of course, it has many more features. In this pandas tutorial series, I’ll show you the most important (that is, the most often used) things that you have to know as an Analyst or a Data Scientist.
This is the first episode and we will start from the basics!Note 1: this is a hands-on tutorial, so I recommend doing the coding part with me! Before we startIf you haven’t done so yet, I recommend going through these articles first:.To follow this pandas tutorial. You will need a fully functioning data server with Python3, numpy and pandas on it.Note 1: Again, with you can set up your data server and Python3. And with you can set up numpy and pandas, too.Note 2: or take this step-by-step. Next step: log in to your server and fire up Jupyter. Then open a new Jupyter Notebook in your favorite browser.
(If you don’t know how to do that, I really do recommend going through the articles I linked in the “ Before we start” section.)Note: I’ll also rename my Jupyter Notebook to “pandastutorial1”. Firing up Jupyter Notebook. Import numpy and pandas to your Jupyter Notebook by running these two lines in a cell:import numpy as npimport pandas as pdNote: It’s conventional to refer to ‘pandas’ as ‘pd’. When you add the as pd at the end of your import statement, your Jupyter Notebook understands that from this point on every time you type pd, you are actually referring to the pandas library.Okay, now we have everything! Let’s start with this pandas tutorial!The first question is: How to open data files in pandasYou might have your data in.csv files or SQL tables. Maybe Excel files.
Or.tsv files. Or something else. But the goal is the same in all cases. If you want to analyze that data using pandas, the first step will be to read it into a data structure that’s compatible with pandas. Pandas data structuresThere are two types of data structures in pandas: Series and DataFrames.Series: a pandas Series is a one dimensional data structure ( “a one dimensional ndarray”) that can store values — and for every value it holds a unique index, too.
Pandas DataFrame exampleIn this pandas tutorial, I’ll focus mostly on DataFrames. The reason is simple: most of the analytical methods I will talk about will make more sense in a 2D datatable than in a 1D array. Loading a.csv file into a pandas DataFrameOkay, time to put things into practice!
Let’s load a.csv data file into pandas!There is a function for it, called readcsv.Start with a simple demo data set, called zoo! This time – for the sake of practicing – you will create a.csv file for yourself! Read the.csv directly from the server (using its URL)Note 2: If you are wondering what’s in this data set – this is the data log of a travel blog. This is a log of one day only (if you are a participant, you will get much more of this data set on the last week of the course;-)). I guess the names of the columns are fairly self-explanatory.
Selecting data from a dataframe in pandasThis is the first episode of this pandas tutorial series, so let’s start with a few very basic data selection methods – and in the next episodes we will go deeper! 1) Print the whole dataframeThe most basic method is to print your whole data frame to your screen. Of course, you don’t have to run the pd.readcsv function again and again and again. Just store its output the first time you run it!articleread = pd.readcsv('pandastutorialread.csv', delimiter=';', names = 'mydatetime', 'event', 'country', 'userid', 'source', 'topic')After that, you can call this articleread value anytime to print your DataFrame!2) Print a sample of your dataframeSometimes, it’s handy not to print the whole dataframe and flood your screen with data. When a few lines is enough, you can print only the first 5 lines – by typing:articleread.headOr the last few lines by typing:articleread.tailOr a few random lines by typing:articleread.sample(5)3) Select specific columns of your dataframeThis one is a bit tricky! Let’s say you want to print the ‘country’ and the ‘userid’ columns only.You should use this syntax:articleread'country', 'userid'Any guesses why we have to use double bracket frames? It seems a bit over-complicated, I admit, but maybe this will help you remember: the outer bracket frames tell pandas that you want to select columns, and the inner brackets are for the list ( remember?
Python lists go between bracket frames) of the column names.By the way, if you change the order of the column names, the order of the returned columns will change, too:articleread'userid', 'country'This is the DataFrame of your selected columns.Note: Sometimes (especially in predictive analytics projects), you want to get Series objects instead of DataFrames. You can get a Series using any of these two syntaxes (and selecting only one column):articleread.useridarticleread'userid'. Output is a Series object and not a DataFrame object 4) Filter for specific values in your dataframeIf the previous one was a bit tricky, this one will be really tricky!Let’s say, you want to see a list of only the users who came from the ‘SEO’ source. In this case you have to filter for the ‘SEO’ value in the ‘source’ column:articlereadarticleread.source 'SEO'It’s worth it to understand how pandas thinks about data filtering:STEP 1) First, between the bracket frames it evaluates every line: is the articleread.source column’s value 'SEO' or not? The results are boolean values ( True or False).STEP 2) Then from the articleread table, it prints every row where this value is True and doesn’t print any row where it’s False.Does it look over-complicated? But this is the way it is, so let’s just learn it because you will use this a lot! ? Functions can be used after each otherIt’s very important to understand that pandas’s logic is very linear (compared to SQL, for instance).
So if you apply a function, you can always apply another one on it. In this case, the input of the latter function will always be the output of the previous function.E.g.
Combine these two selection methods:articleread.head 'country', 'userid'This line first selects the first 5 rows of our data set. And then it takes only the ‘country’ and the ‘userid’ columns.Could you get the same result with a different chain of functions? Of course you can:articleread 'country', 'userid'.headIn this version, you select the columns first, then take the first five rows. The result is the same – the order of the functions (and the execution) is different.One more thing.
What happens if you replace the ‘articleread’ value with the original readcsv function:pd.readcsv('pandastutorialread.csv', delimiter=';', names = 'mydatetime', 'event', 'country', 'userid', 'source', 'topic') 'country', 'userid'.headThis will work, too – only it’s ugly (and inefficient). But it’s really important that you understand that working with pandas is nothing but applying the right functions and methods, one by one. Test yourself!As always, here’s a short assignment to test yourself! Solve it, so the content of this article can sink in better!Select the userid, the country and the topic columns for the users who are from country2! Print the first five rows only!Okay, go ahead and solve it!And here’s my solution!It can be a one-liner:articlereadarticleread.country 'country2'userid','topic', 'country'.headOr, to be more transparent, you can break this into more lines: arfiltered = articlereadarticleread.country 'country2'arfilteredcols = arfiltered'userid','topic', 'country'arfilteredcols.headEither way, the logic is the same. First you take your original dataframe ( articleread), then you filter for the rows where the country value is country2 ( articleread.country 'country2'), then you take the three columns that were required ( 'userid','topic', 'country') and eventually you take the first five rows only (.head). ConclusionYou are done with the first episode of my pandas tutorial series!
In the next article, you can learn more about the different aggregation methods (e.g. Sum, mean, max, min) and about grouping (so basically about segmentation). Stay with me:!.
If you want to learn more about how to become a data scientist, take my 50-minute video course: (It’s free!). Also check out my 6-week online course:Cheers,Tomi Mester.
Data = pd.readcsv('data.csv', skiprows=4)dataSo, we have used the readcsv function and skipped the first four rows and then display the remaining rows. Run the cell and see the output. It will show the first 30 rows and last 30 rows if there are so many rows. In our data file, there are above 29,000 rows. That is why we can see the first and last 30 rows.Import MatplotlibWe can import the Matplotlib library using the following code. Write the following code inside the next Jupyter Notebook cell.
Import matplotlib.pyplot as plt%matplotlib inlineNow, hit the Ctrl + Enter and it will import the library.An iPython kernel works seamlessly with Matplotlib.pyplot library.You can see in the above code that we have used the%matplotlib inline magic command which means that it will show a different kind of charts inside the Jupyter Notebook. Now let’s take an example of one by one chart in Jupyter Notebook.Let’s plot a graph of different sports takes part in the Olympics Edition 2008.We have already imported the matplotlib.pyplot library in the Notebook, now we will use that to plot the graph of different sports.
Plot a Line Chart using Matplotlib.pyplot LibraryWe will display the line chart. So let’s add the following code in the Jupyter Notebook. FilteredData = datadata.Edition 2008filteredData.Sport.valuecounts.plotNow, in the above code, first we have got the data of Olympics 2008 edition, and then we have to count the number of sports that Olympic has and plot the line graph based on that data. The output of the above code in Jupyter Notebook is following.By default, the plot function gives us the line chart. Plot a Bar Chart using Matplotlib.pyplot LibraryWe can also display the bar chart instead of the line chart.
We need to pass a parameter kind and value to the bar, and it will show the bar chart. See the following example. Write the following code in the cell. FilteredData = datadata.Edition 2008filteredData.headfilteredData.Sport.valuecounts.plot(kind='bar')Here, we have used the head function to display the first five rows and then plot the bar charts based on the sports count held in the 2008 Olympics. The output is following.The above bar chart is the Vertical Bar Chart.We can also get the Horizontal plot using the following code.
FilteredData.Sport.valuecounts.plot(kind='barh')We have passed the kind=’barh’ parameter, and it will give us the following result.Plot a Pie Chart using Matplotlib.pyplot LibraryWe can also display the pie chart instead of the bar chart. We need to pass a parameter kind and value to the pie, and it will show the bar chart. See the following example.
Write the following code in the cell. FilteredData = datadata.Edition 2008filteredData.headfilteredData.Sport.valuecounts.plot(kind='pie')See the output below.So, we have learned all kinds of charts using the Real-time example in Python Jupyter Notebook.Finally, Python Matplotlib Tutorial With Example is over.