Python Pandas for Beginners: Data Manipulation and Analysis Guide with Q&A
import pandas as pd
data = [4,8,15,16,23,42]
A = pd.Series(data)
print(A)
0 4 1 8 2 15 3 16 4 23 5 42 dtype: int64
Q2. Create a variable of list type containing 10 elements in it, and apply pandas.Series function on the variable print it.
Ans.
import pandas as pd
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
series = pd.Series(my_list)
print(series)
0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 dtype: int64
Q3. Create a Pandas DataFrame that contains the following data:
|Name | Age | Gender |
|Alice | 25 | Female |
|Bob | 30 | Male |
|Claire | 27 |Female |
Then, print the DataFrame.
Ans.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Claire'],
'Age': [25, 30, 27],
'Gender': ['Female', 'Male', 'Female']
}
df = pd.DataFrame(data)
print(df)
Name Age Gender 0 Alice 25 Female 1 Bob 30 Male 2 Claire 27 Female
Q4. What is ‘DataFrame’ in pandas and how is it different from pandas.series? Explain with an example.
Ans.
In Pandas, a DataFrame is a two-dimensional labeled data structure that represents a tabular, spreadsheet-like data object. It consists of rows and columns, where each column can have a different data type (e.g., numeric, string, boolean). Think of it as a table where each column represents a variable and each row represents an observation or entry.
On the other hand, a Pandas Series is a one-dimensional labeled array capable of holding any data type. It can be seen as a single column of a DataFrame or a single variable. Series can be created from various data structures like lists, arrays, or dictionaries.
Here's an example to illustrate the difference between a DataFrame and a Series:
import pandas as pd
# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Claire'],
'Age': [25, 30, 27],
'Gender': ['Female', 'Male', 'Female']
}
df = pd.DataFrame(data)
# Create a Series
ages = pd.Series([25, 30, 27])
print("DataFrame:")
print(df)
print("\nSeries:")
print(ages)
DataFrame: Name Age Gender 0 Alice 25 Female 1 Bob 30 Male 2 Claire 27 Female Series: 0 25 1 30 2 27 dtype: int64
Q5. What are some common functions you can use to manipulate data in a Pandas DataFrame? Can you give an example of when you might use one of these functions?
Ans.
Pandas provides a wide range of functions to manipulate data in a DataFrame. Here are some commonly used functions along with an example scenario where you might use them:
head()
andtail()
: These functions allow you to view the first or last few rows of a DataFrame, respectively. They are useful for quickly inspecting the data.Example:
df.head() # View the first 5 rows of the DataFrame df.tail(10) # View the last 10 rows of the DataFrame
info()
: This function provides a summary of the DataFrame, including the column names, data types, and non-null count. It is helpful for understanding the structure of the data.Example:
df.info() # Display summary information about the DataFrame
describe()
: This function generates descriptive statistics for numerical columns in the DataFrame, such as count, mean, standard deviation, minimum, and maximum values. It gives a quick overview of the distribution of the data.Example:
df.describe() # Compute descriptive statistics of the DataFrame
sort_values()
: This function allows you to sort the DataFrame based on one or more columns. It is useful for arranging the data in a specific order.Example:
sorted_df = df.sort_values('Age') # Sort the DataFrame by the 'Age' column
groupby()
: This function enables grouping the data based on one or more columns and applying aggregate functions to the grouped data. It is useful for performing group-wise calculations and analysis.Example:
grouped_df = df.groupby('Gender')['Age'].mean() # Compute the average age by gender
drop()
: This function allows you to remove rows or columns from the DataFrame. It is handy when you want to eliminate irrelevant or unnecessary data.Example:
cleaned_df = df.drop(['Column1', 'Column2'], axis=1) # Drop specified columns from the DataFrame
These are just a few examples of the many functions available in Pandas for data manipulation. The choice of function depends on the specific data manipulation task you want to perform, such as data exploration, cleaning, filtering, aggregation, or sorting.
Q6. Which of the following is mutable in nature Series, DataFrame, Panel?
Ans.
In Pandas, both Series and DataFrame are mutable in nature, while Panel is immutable.
- Series: A Pandas Series is mutable, meaning you can modify its elements, add or remove values dynamically. You can change the values of specific elements by assigning new values to them or use various methods to modify the Series in-place.
Example:
import pandas as pd
series = pd.Series([1, 2, 3, 4, 5])
series[2] = 10 # Modify the value at index 2
series[3] = series[3] * 2 # Perform a computation on the value at index 3
- DataFrame: Similarly, a Pandas DataFrame is mutable. You can modify its columns, add or remove rows or columns, change values, and perform various data manipulation operations on the DataFrame.
Example:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Claire'], 'Age': [25, 30, 27]}
df = pd.DataFrame(data)
df['Age'] = df['Age'] + 1 # Increment the 'Age' column by 1
df.loc[2, 'Age'] = 28 # Change the value in the 'Age' column for the row with index 2
- Panel: On the other hand, a Panel in Pandas is immutable, meaning its contents cannot be modified once created. Panels were used in older versions of Pandas to represent three-dimensional data, but they have been deprecated in favor of using multi-dimensional arrays or DataFrames.
While Series and DataFrame can be modified directly, it is important to note that modifying a Pandas object in-place can have implications on the original data. Therefore, it's recommended to make a copy of the object if you need to preserve the original data.
Q7. Create a DataFrame using multiple Series. Explain with an example.
Ans.
To create a DataFrame using multiple Series, you can combine the series together as columns using the pd.concat()
function or by directly passing them as a dictionary to the pd.DataFrame()
function. Here's an example:
import pandas as pd
# Create Series
name_series = pd.Series(['Alice', 'Bob', 'Claire'])
age_series = pd.Series([25, 30, 27])
gender_series = pd.Series(['Female', 'Male', 'Female'])
# Create DataFrame using pd.concat()
df_concat = pd.concat([name_series, age_series, gender_series], axis=1)
df_concat.columns = ['Name', 'Age', 'Gender']
print("DataFrame using pd.concat():")
print(df_concat)
# Create DataFrame using pd.DataFrame()
data = {
'Name': name_series,
'Age': age_series,
'Gender': gender_series
}
df_dict = pd.DataFrame(data)
print("\nDataFrame using pd.DataFrame():")
print(df_dict)
Output:
DataFrame using pd.concat():
Name Age Gender
0 Alice 25 Female
1 Bob 30 Male
2 Claire 27 Female
DataFrame using pd.DataFrame():
Name Age Gender
0 Alice 25 Female
1 Bob 30 Male
2 Claire 27 Female
when creating a DataFrame using multiple series, it's important to ensure that the series have the same length and are aligned properly to avoid any unexpected data alignment issues.